Abstract
Entity matching (EM) is essential for connecting data across sources, particularly in sensitive domains like human trafficking investigations. However, research faces a critical gap: the lack of realistic gold standard datasets containing personal identifying information. This paper introduces a methodology for creating gold standard datasets, demonstrated through the development of a representative dataset for personal identification information (PII). Our approach combines multiple EM techniques to identify candidate matches, followed by a systematic annotation and validation process. Notably, our findings demonstrate that different techniques identify largely non-overlapping sets of matches, validating the need for our multi-technique methodology. Our approach provides a reproducible template for creating gold standard datasets in domains where realistic evaluation resources are scarce.
| Original language | English |
|---|---|
| Title of host publication | Availability, Reliability and Security |
| Editors | Bart Coppens, Bruno Volckaert, Vincent Naessens, Bjorn De Sutter |
| Place of Publication | Cham |
| Publisher | Springer Nature Switzerland Cham |
| Pages | 203-218 |
| Number of pages | 16 |
| ISBN (Print) | 978-3-032-00639-4 |
| Publication status | Published - Aug 2025 |