SimLex-999 for Dutch

Open Access
Authors
Publication date 2024
Host editors
  • N. Calzolari
  • M.-Y. Kan
  • V. Hoste
  • A. Lenci
  • S. Sakti
  • N. Xue
Book title The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Book subtitle main conference proceedings : 20-25 May, 2024, Torino, Italia
ISBN (electronic)
  • 9782493814104
Series COLING
Event 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Pages (from-to) 14832–14845
Publisher ELRA Language Resources Association
Organisations
  • Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract
Word embeddings revolutionised natural language processing by effectively representing words as dense vectors. Although many datasets exist to evaluate English embeddings, few cater to Dutch. We developed a Dutch variant of the SimLex-999 word similarity dataset by gathering similarity judgements from 235 native Dutch speakers. Subsequently, we evaluated two popular Dutch language models, Bertje and RobBERT, finding that Bertje showed superior alignment with human semantic similarity judgments compared to RobBERT. This study provides the first intrinsic Dutch word embedding evaluation dataset, which enables accurate assessment of these embeddings and fosters the development of effective Dutch language models.
Document type Conference contribution
Language English
Published at https://aclanthology.org/2024.lrec-main.1292
Downloads
2024.lrec-main.1292-1 (Final published version)
Permalink to this page
Back