Building Realistic Ground Truth Datasets of Personal Identification Information for Entity Matching

Ifeoluwapo Aribilola, Matteo Catena, Mamoona Asghar, John Breslin, Renaud Delbru

Research output: Chapter in Book or Conference Publication/ProceedingConference Publicationpeer-review

Abstract

Entity matching (EM) is essential for connecting data across sources, particularly in sensitive domains like human trafficking investigations. However, research faces a critical gap: the lack of realistic gold standard datasets containing personal identifying information. This paper introduces a methodology for creating gold standard datasets, demonstrated through the development of a representative dataset for personal identification information (PII). Our approach combines multiple EM techniques to identify candidate matches, followed by a systematic annotation and validation process. Notably, our findings demonstrate that different techniques identify largely non-overlapping sets of matches, validating the need for our multi-technique methodology. Our approach provides a reproducible template for creating gold standard datasets in domains where realistic evaluation resources are scarce.
Original languageEnglish
Title of host publicationAvailability, Reliability and Security
EditorsBart Coppens, Bruno Volckaert, Vincent Naessens, Bjorn De Sutter
Place of PublicationCham
PublisherSpringer Nature Switzerland Cham
Pages203-218
Number of pages16
ISBN (Print)978-3-032-00639-4
Publication statusPublished - Aug 2025

Fingerprint

Dive into the research topics of 'Building Realistic Ground Truth Datasets of Personal Identification Information for Entity Matching'. Together they form a unique fingerprint.

Cite this