Quantitative comparative linguistics based on tiny corpora: N-gram language identification of wordlists of known and unknown languages from Amazonia and beyond

Authors	F. Seifart R. Mundry
Publication date	2015
Journal	Journal of Quantitative Linguistics
Volume \| Issue number	22 \| 3
Pages (from-to)	202-214
Organisations	Faculty of Humanities (FGw) - Amsterdam Institute for Humanities Research (AIHR) - Amsterdam Center for Language and Communication (ACLC)
Abstract	Can an unknown Amazonian language be identified by statistical procedures based on n-gram frequencies if only a short list of words is available and at the same time, the available data of the potential candidate languages are also limited to relatively short wordlists? In this paper we show that n-gram frequencies (specifically 1-grams and 2-grams) allow us to identify languages reliably based on as few as 20 words, as long as these are transcribed consistently, and as long as characteristic monogram and bigram frequencies for these languages have previously been established based on consistently transcribed data. If no such consistently transcribed data are available, as is the case of our Amazonian case study, such procedures clearly fail for wordlists with 50 or fewer words. Our study thus contributes to exploring the limits of such automated detection procedures, both in terms of corpus size and transcription quality.
Document type	Article
Language	English
Published at	https://doi.org/10.1080/09296174.2015.1037161 (Final published version)
Permalink to this page

Back

UvA-DARE