Intrinsic evaluation of Mono- and Multilingual Dutch Language Models

Open Access
Authors
Publication date 2025
Journal Computational Linguistics in the Netherlands Journal
Event Computational Linguistics in the Netherlands (CLIN) 34
Volume | Issue number 14
Pages (from-to) 525-553
Organisations
  • Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
  • Faculty of Humanities (FGw) - Amsterdam Institute for Humanities Research (AIHR)
Abstract
Through transfer learning, multilingual language models can produce good results on extrinsic, downstream NLP tasks in low-resource languages despite a lack of abundant training data. In most cases, however, monolingual models still perform better. Using the Dutch SimLex-999 dataset, we intrinsically evaluate several pre-trained monolingual stacked encoder LLMs for Dutch and compare them to several multilingual models that support Dutch, including two with parallel architectures (BERTje and mBERT). We also try to improve these models’ semantic representations by tuning the multilingual models on additional Dutch data. Furthermore, we explore the effect of tuning these models on written versus transcribed spoken data. While we can improve multilingual model performance through fine-tuning, we find that significant amounts of fine-tuning data and compute are required to outscore monolingual models on the intrinsic evaluation metric.
Document type Article
Language English
Published at https://www.clinjournal.org/clinj/article/view/215
Downloads
24_intrinsic_vlantis (Final published version)
Permalink to this page
Back