<i>Mask More and Mask Later</i>: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

B. Liao; D. Thulke; S. Hewavitharana; H. Ney; C. Monz

doi:https://doi.org/10.48550/arXiv.2211.04898

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

Authors	B. Liao D. Thulke S. Hewavitharana H. Ney C. Monz
Publication date	2022
Host editors	Y. Goldberg Z. Kozareva Y. Zhang
Book title	Findings of the Association for Computational Linguistics: EMNLP 2022
Book subtitle	Conference on Empirical Methods in Natural Language Processing (EMNLP), Abu Dhabi, United Arab Emirates, 7-11 December 2022
Event	The 2022 Conference on Empirical Methods in Natural Language Processing
Pages (from-to)	1478–1492
Publisher	Stroudsburg, PA: Association for Computational Linguistics
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	The pre-training of masked language models (MLMs) consumes massive computation to achieve good results on downstream NLP tasks, resulting in a large carbon footprint. In the vanilla MLM, the virtual tokens, [MASK]s, act as placeholders and gather the contextualized information from unmasked tokens to restore the corrupted information. It raises the question of whether we can append [MASK]s at a later layer, to reduce the sequence length for earlier layers and make the pre-training more efficient. We show: (1) [MASK]s can indeed be appended at a later layer, being disentangled from the word embedding; (2) The gathering of contextualized information from unmasked tokens can be conducted with a few layers. By further increasing the masking rate from 15% to 50%, we can pre-train RoBERTa-base and RoBERTa-large from scratch with only 78% and 68% of the original computational budget without any degradation on the GLUE benchmark. When pre-training with the original budget, our method outperforms RoBERTa for 6 out of 8 GLUE tasks, on average by 0.4%.
Document type	Conference contribution
Note	With supplementary video
Language	English
Published at	https://doi.org/10.48550/arXiv.2211.04898 (Accepted author manuscript) https://doi.org/10.18653/v1/2022.findings-emnlp.106 (Final published version)
Other links	https://github.com/BaohaoLiao/3ml
Downloads	2022.findings-emnlp.106 (Final published version)
Supplementary materials	2022.findings-emnlp.106
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token