A Simple but Effective Approach to Improve Arabizi-to-English Statistical Machine Translation

Open Access
Authors
Publication date 2016
Book title WNUT 2016 : the 2nd Workshop on Noisy User-generated Text
Book subtitle proceedings of the Workshop : December 11, 2016, Osaka, Japan
ISBN
  • 9784879747075
Event The 2nd Workshop on Noisy User-generated Text (W-NUT)
Pages (from-to) 43-50
Number of pages 8
Publisher The COLING 2016 Organizing Committee
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
  • Faculty of Science (FNWI)
Abstract
A major challenge for statistical machine translation (SMT) of Arabic-to-English user-generated text is the prevalence of text written in Arabizi, or Romanized Arabic. When facing such texts, a translation system trained on conventional Arabic-English data will suffer from extremely low model coverage. In addition, Arabizi is not regulated by any official standardization and therefore highly ambiguous, which prevents rule-based approaches from achieving good translation results. In this paper, we improve Arabizi-to-English machine translation by presenting a simple but effective Arabizi-to-Arabic transliteration pipeline that does not require knowledge by experts or native Arabic speakers. We incorporate this pipeline into a phrase-based SMT system, and show that translation quality after automatically transliterating Arabizi to Arabic yields results that are comparable to those achieved after human transliteration.
Document type Conference contribution
Language English
Published at http://aclweb.org/anthology/W16-3908
Downloads
W16-3908 (Final published version)
Permalink to this page
Back