Assessing the Impact of OCR Quality on Downstream NLP Tasks

D. van Strien; K. Beelen; M. Coll Ardanuy; K. Hosseini; B. McGillivray; G. Colavizza

doi:https://doi.org/10.5220/0009169004840496

Assessing the Impact of OCR Quality on Downstream NLP Tasks

Authors	D. van Strien K. Beelen M. Coll Ardanuy K. Hosseini B. McGillivray G. Colavizza
Publication date	2020
Host editors	A. Rocha L. Steels J. van den Herik
Book title	ICAART 2020
Book subtitle	proceedings of the 12th International Conference on Agents and Artificial Intelligence : Valletta, Malta, February 22-24, 2020
ISBN	9789897583957
Event	12th International Conference on Agents and Artificial Intelligence, ICAART 2020
Volume \| Issue number	1
Pages (from-to)	484-496
Number of pages	13
Publisher	Setúbal: ScitePress
Organisations	Faculty of Humanities (FGw) - Amsterdam Institute for Humanities Research (AIHR) Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks — sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning — using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR.
Document type	Conference contribution
Language	English
Related dataset	Is your OCR good enough? A comprehensive assessment of the impact of OCR quality on downstream tasks
Published at	https://doi.org/10.5220/0009169004840496 (Final published version)
Other links	https://www.scopus.com/pages/publications/85083176287
Downloads	ARTIDIGH_2020_7_CR1 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Assessing the Impact of OCR Quality on Downstream NLP Tasks