Quantitative comparative linguistics based on tiny corpora: N-gram language identification of wordlists of known and unknown languages from Amazonia and beyond

Authors
Publication date 2015
Journal Journal of Quantitative Linguistics
Volume | Issue number 22 | 3
Pages (from-to) 202-214
Organisations
  • Faculty of Humanities (FGw) - Amsterdam Institute for Humanities Research (AIHR) - Amsterdam Center for Language and Communication (ACLC)
Abstract
Can an unknown Amazonian language be identified by statistical procedures based on n-gram frequencies if only a short list of words is available and at the same time, the available data of the potential candidate languages are also limited to relatively short wordlists? In this paper we show that n-gram frequencies (specifically 1-grams and 2-grams) allow us to identify languages reliably based on as few as 20 words, as long as these are transcribed consistently, and as long as characteristic monogram and bigram frequencies for these languages have previously been established based on consistently transcribed data. If no such consistently transcribed data are available, as is the case of our Amazonian case study, such procedures clearly fail for wordlists with 50 or fewer words. Our study thus contributes to exploring the limits of such automated detection procedures, both in terms of corpus size and transcription quality.
Document type Article
Language English
Published at https://doi.org/10.1080/09296174.2015.1037161
Permalink to this page
Back