Neural Language Models for Nineteenth-Century English

K. Hosseini; K. Beelen; G. Colavizza; M. Coll Ardanuy

doi:https://doi.org/10.48550/arXiv.2105.11321

Neural Language Models for Nineteenth-Century English

Authors	K. Hosseini K. Beelen G. Colavizza M. Coll Ardanuy
Publication date	27-09-2021
Journal	Journal of Open Humanities Data
Article number	22
Volume \| Issue number	7
Number of pages	6
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	We present four types of neural language models trained on a large historical dataset of books in English, published between 1760 and 1900, and comprised of ≈5.1 billion tokens. The language model architectures include word type embeddings (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate instances on text published before 1850 for the type embeddings, and four instances considering different time slices for BERT. Our models have already been used in various downstream tasks where they consistently improved performance. In this paper, we describe how the models have been created and outline their reuse potential.
Document type	Article
Note	Data paper.
Language	English
Related dataset	Neural Language Models for Nineteenth-Century English (dataset; language model zoo)
Published at	https://doi.org/10.48550/arXiv.2105.11321 (Submitted manuscript) https://doi.org/10.5334/johd.48 (Final published version)
Other links	https://doi.org/10.5281/zenodo.4782245 https://github.com/Living-with-machines/histLM
Downloads	2105.11321 (Submitted manuscript) 48-761-1-PB (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Neural Language Models for Nineteenth-Century English