defoe: A Spark-based Toolbox for Analysing Digital Historical Textual Data

Open Access
Authors
  • R. Filgueira
  • M. Jackson
  • A. Roubickova
  • A. Krause
  • R. Ahnert
  • T. Hauswedell
  • J. Nyhan
  • D. Beavan
  • T. Hobson
  • M.C. Ardanuy
  • G. Colavizza ORCID logo
  • J. Hetherington
  • M. Terras
Publication date 2019
Book title IEEE 15th International Conference on eScience
Book subtitle proceedings : 24-27 September 2019, San Diego, California
ISBN
  • 9781728124520
ISBN (electronic)
  • 9781728124513
Event 15th IEEE International Conference on eScience, eScience 2019
Article number 9041813
Pages (from-to) 235-242
Number of pages 8
Publisher Los Alamitos, California: IEEE Computer Society
Organisations
  • Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
  • Faculty of Humanities (FGw) - Amsterdam Institute for Humanities Research (AIHR)
Abstract

This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations. The proposed tool has been successfully evaluated using five different large-scale historical text datasets and two HPC environments, as well as on desktops. Results shows that defoe allows researchers to query multiple datasets in parallel from a single command-line interface and in a consistent way, without any HPC environment-specific requirements.

Document type Conference contribution
Language English
Published at https://doi.org/10.1109/eScience.2019.00033
Other links http://www.proceedings.com/53619.html https://www.scopus.com/pages/publications/85083244614
Downloads
09041813 (Final published version)
Permalink to this page
Back