A semi-automated approach for discovery of politically exposed persons in research data: alternative approaches
Published Wednesday 22 January 2025 at 16:51
In an earlier publication, BROD tech partner Ontotext (ONTO) outlined the initial experiments with a machine learning (ML) pipeline to extract politically exposed persons (PEPs) from research data. The ONTO team applied a customised multilingual entity linking pipeline that includes the IXA[1] method for entity boundary detection and mGENRE[2] method for entity disambiguation (hereinafter IXA_mGENRE). While this algorithm exhibited a relatively good performance on the small sample datasets (around 70% F1 score), the processing through the mGENRE part is very slow, which could render it unsuitable for annotating a large amount of data. Therefore, a second round of experiments was conducted to assess how IXA_mGENRE would perform against two other pipelines. The two alternative approaches included:
- BELA[3] (plus subsequent person classification as the one implemented in addition to IXA_mGENRE) – BELA is the first multilingual end-to-end system for entity linking. It is very fast due to its one-pass nature. However, the model does not return the categories of the extracted entities, hence the need to additionally apply the classifier method.
- MultiNERD[4] + mGENRE (hereinafter MultiNERD_mGENRE) – MultiNERD is a multilingual, multi-genre and fine-grained dataset for named entity recognition (NER) and disambiguation. Its corpus consists of both Wikipedia and WikiNews articles written in 10 languages (Chinese, Dutch, English, French, German, Italian, Russian, Portuguese, Polish, and Spanish). The mentions are linked to three different knowledge bases (Wikidata, Wikipedia and BabelNet[5]). MiltiNERD contains mentions for 15 specific NER categories, one of which is “person”. There is a freely available multilingual NER model trained on MultiNERD, which the ONTO team used as an alternative to the IXA method and thus formed the MultiNERD_mGENRE pipeline. Even though the MultiNERD NER model is not specifically trained for Bulgarian and Romanian, it does return results for these languages so it might be suitable for PEP extraction. It should be noted that unlike the other two algorithms, this pipeline does not require subsequent person classification, as the model already detects the “person” category.
For the second round of experiments the ONTO team worked with a subset of the data provided by both CSD and SNSPA, more specifically the first 20,000 articles in the datasets. In addition, this time the datasets were not sample ones, rather contained data gathered for the analysis of BROD key research topics on Schengen membership, Ukrainian military aid, and eurozone entry. The tables below give only a glimpse into how annotations for one and the same document differed across the three pipelines.
Pipeline annotations comparison for “person”, “not linked”: example in Bulgarian
Pipeline annotations comparison for “person”, “not linked”: example in Romanian
The amount of processed data was considerably larger than the amount of the data processed in the first iteration. For this reason, it was not possible to perform exhaustive manual evaluation on the results. Instead, some statistics of the results were calculated and only partial manual analysis was performed. Based on this analysis, the ONTO research team has come to the following conclusions:
- Most entities categorized as “person” by MultiNERD are indeed people. The model does make mistakes but their percentage seems to be much lower than the mistakes by IXA_mGENRE.
- Most of the “linked” people are correctly linked. In some cases “not linked” entities do have a corresponding Wikidata item.
- MultiNERD returns the highest number of “not linked” people. This is important for extracting PEPs, as the assumption is that in some cases they are not covered in an existing general knowledge base. However, it is important to note that all three pipelines return a high number of “not linked” people, for whom a Wikidata item does exist. Having said that, it is questionable whether MultiNERD returns the highest number of PEPs, for whom no entry in a knowledge base exists. BELA is the only model out of the ones explored in this research that is trained specifically to extract entities from Wikidata. In the other pipelines, the part of the system responsible for the entity (boundary) extraction, IXA and MultiNERD, are not trained specifically on Wikidata. Intuitively, it is expected that these two models would return a higher number of PEPs not present in Wikidata.
- Comparing the BELA and IXA_mGENRE pipelines, it seems that IXA_mGENRE extracts a significantly higher number of entities than BELA. However, the percentage of the “linked” entities is much lower. The manual analysis of the results has established that in comparison to BELA, the IXA_mGENRE pipeline introduces more noise and more often extracts parts of the text that are not actual entities.
With these experimental results in mind, we can summarise the pros and cons of each pipeline:
IXA_mGENRE
- Produces results with satisfactory quality.
- mGENRE is very slow, not suitable for annotating a large amount of data, as it takes a lot of time.
- “Person”/“no person” classification is slow so it could be problematic when processing a large amount of data.
BELA
- BELA is very fast, suitable for processing a large amount of data.
- Produces results with good quality.
- “Person”/“no person” classification is slow so it could be problematic when processing a large amount of data.
MultiNERD_mGENRE
- MultiNERD is slower than BELA but still produces results in a reasonable time and could be used for the processing of a large amount of data.
- MultiNERD returns categories and can be beneficial for extracting other types of data, not only people.
- For the case of PEP extraction we only pass the entities with category “person” to mGENRE and in that way the whole process is faster, as mGENRE has to process much less data than in the IXA_mGENRE pipeline.
In summary, based on the results of the experiments and the subsequent analysis, there is no conclusive say on which of the three pipelines is the best one for PEP extraction. Still, it has become evident that the MultiNERD_mGENRE and BELA pipelines appear to be better suited for the task than the IXA_mGENRE approach, which is slow and introduces a higher percentage of noise. If needed these two pipelines could be employed to facilitate BROD research partners’ data analysis activities.
[1] XA/Cogcomp at SemEval-2023 Task 2: Context-enriched Multilingual Named Entity Recognition Using Knowledge Bases - ACL Anthology
[2] https://arxiv.org/abs/2103.12528
[3] GitHub - facebookresearch/BELA: Bi-encoder entity linking architecture
[4] GitHub - Babelscape/multinerd: Repository for the paper "MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)" (NAACL 2022).
[5] https://babelnet.org/