A semi-automated approach for discovery of politically exposed persons in research data: initial experiments


Published Monday 16 December 2024 at 14:45

Pipeline.jpg

The comprehensive analysis of textual data to handle a research task greatly benefits from extracting knowledge about the individuals and organisations mentioned. This is particularly true for studies in the mis/disinformation domain, where accurate data on prominent actors can tell a lot about the contents of disinformation narratives.

Striving for depth and excellence in their work, BROD research partners CSD (Bulgaria) and SNSPA (Romania) have been looking for a way to at least partially automate the task of extracting named entities from the data gathered and also go beyond that by linking these to an external dataset with information about politically exposed persons (PEPs). This additional level of analysis would allow to better study their connections and influence on societal processes.

Because of its extensive NLP expertise, BROD technical partner Ontotext (ONTO) has been keen to contribute towards a solution. Leveraging a state-of-the-art multilingual entity linking pipeline (IXA[1] + mGENRE[2]) that links discovered entities to Wikidata identifiers, the ONTO team carried out an initial round of experiments on sample data, provided by CSD. There are two important considerations, which required making specific adjustments to the pipeline:

  1. Some PEPs are new or not so popular figures in the public realm, hence are not featured in the “Wiki universe.” In the terms of the pipeline, this means such individuals would be classified as “not linked” and would be discarded, however, important information will be lost. Therefore, the algorithm has been adjusted to retrieve “not linked” entities for further analysis.
  2. As the pipeline employed has no module for entity type recognition, a customised classifier model has been added to automatically set “person” or “no person” type to the “not linked” entities.

Let’s illustrate the processing workflow with examples from two sample datasets and in the two languages of the hub (Bulgarian and Romanian). Both datasets contain Facebook posts, extracted via CrowdTangle. The first one revolves around the “special military operation” in Ukraine and is only in Bulgarian. The second one comprises posts, published by the embassies of different EU countries, of the USA and Russia, throughout the Balkan region. The test data in Romanian is taken from this second dataset.

The figures below exemplify the three main processing steps:

  1. Run the data through the customised pipeline to detect “linked” and “not linked” entities;
  2. Select “not linked” entities for further analysis;
  3. Automatically classify “not linked” as “person” or “no person”.

1734353143155-926.png

Pipeline processing steps: example in Bulgarian

1734353143168-349.png

Pipeline processing steps: example in Romanian

The last column in the tables at step 3 (“is person”) shows the final phase, namely the manual evaluation of the results from this first round of experiments, aiming to assess how well the adjusted pipeline can solve the task of extracting PEPs. The analysis has revealed that for the datasets processed the number of “not linked” entities is around 30% of the total number of extracted entities. The F1 score, which is the main performance metric for machine learning models, is similar for the datasets tested, being around 70%.

In conclusion, the evaluation results from these initial experiments have demonstrated that ONTO team’s customised entity linking pipeline could be beneficial for the task of extracting PEPs from unstructured text. On the other hand, there is still room for improvement to achieve a higher F1 score. This has prompted ONTO researchers to continue fine-tuning the pipeline as well as to look for alternative approaches.


[1] XA/Cogcomp at SemEval-2023 Task 2: Context-enriched Multilingual Named Entity Recognition Using Knowledge Bases - ACL Anthology 

[2] https://arxiv.org/abs/2103.12528 

BROD