Digital weight watching: reconstruction of scanned documents

Authors
Publication date 2009
Host editors
  • D. Lopresti
  • S. Roy
  • K. Schulz
  • L.V. Subramaniam
Book title AND 2009 : proceedings of the Third Workshop on Analytics for Noisy Unstructured Text Data
Book subtitle July 23-24, 2009, Barcelona, Spain
ISBN (electronic)
  • 9781605584966
Event Third Workshop on Analytics for Noisy Unstructured Text Data (AND 2009), Barcelona, Spain
Pages (from-to) 25-31
Publisher New York, NY: ACM Press
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
Scanned and OCRed data leads to large file sizes if facsimile images are included. This makes storage of, and providing online access to large data sets costly. Manually analyzing such data is cumbersome because of long download and processing times. It may thus be advantageous to reconstruct the scanned documents as documents without scanned images which nevertheless closely resemble the original. We have done this reconstruction for a data set of Dutch parliamentary proceedings with positive results. 1.5% of the original storage space was needed, while the documents resembled the originals to a high degree. We describe the reconstruction process and evaluate the costs, the benefits and the quality.
Document type Conference contribution
Language English
Published at https://doi.org/10.1145/1568296.1568303
Permalink to this page
Back