From shadows to connections Finding entities in archival data and beyond

Open Access
Authors
Supervisors
Cosupervisors
Award date 07-03-2025
ISBN
  • 9789493431089
Number of pages 119
Organisations
  • Faculty of Humanities (FGw) - Amsterdam Institute for Humanities Research (AIHR) - Amsterdam School for Heritage, Memory and Material Culture (AHM)
  • Faculty of Science (FNWI)
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
Finding named entities in historical documents helps to uncover hidden connections. However, entity linking (EL) on archival data comes with challenges, including ambiguity, noise, and historical spelling variations. This thesis examines several of these challenges.
First, we define entity overshadowing, a phenomenon in entity disambiguation (ED) where entity commonness outweighs relevance. We discover that modern ED systems, despite performing well on common benchmarks, fail to detect context changes, which means ED remains an open task. Then, we show that prioritising contextual information improves the results for overshadowed entities without compromising performance on more common ones.
Next, we focus on the unique obstacles of historical data. We show that mention detection, the first step of EL approached as named entity recognition (NER), is affected by OCR noise and spelling variations. Then, we examine three BERT-based models pre-trained on different datasets and fine-tuned for NER on historical Dutch. Our experiments show that historical pre-training is not a prerequisite for good performance.
For handwritten archival data, extra steps are needed before entities can be found. We show that using a single multilingual handwritten text recognition (HTR) model on a multilingual and multi-authored archive is a viable alternative to language-specific models when resources are limited.
Finally, we turn to a case study: analysing the correspondence of Sybren Valkema, a Dutch glass artist and educator. Our method combines text recognition, EL, and social network construction to discover insights for art history research. Comparative analysis with a manually curated network demonstrates the value of EL for uncovering historical social networks.
Document type PhD thesis
Language English
Downloads
Permalink to this page
cover
Back