Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

B. Zhang; I. Titov; R. Sennrich

doi:https://doi.org/10.18653/v1/D19-1083

Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

Authors	B. Zhang I. Titov R. Sennrich
Publication date	2019
Host editors	K. Inui J. Jiang V. Ng X. Wan
Book title	2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing
Book subtitle	EMNLP-IJCNLP 2019 : proceedings of the conference : November 3-7, 2019, Hong Kong, China
ISBN (electronic)	9781950737901
Event	2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing
Pages (from-to)	898-909
Publisher	Stroudsburg, PA: The Association for Computational Linguistics
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC) Faculty of Science (FNWI)
Abstract	The general trend in NLP is towards increasing model capacity and performance via deeper neural networks. However, simply stacking more layers of the popular Transformer architecture for machine translation results in poor convergence and high computational overhead. Our empirical analysis suggests that convergence is poor due to gradient vanishing caused by the interaction between residual connection and layer normalization. We propose depth-scaled initialization (DS-Init), which decreases parameter variance at the initialization stage, and reduces output variance of residual connections so as to ease gradient back-propagation through normalization layers. To address computational cost, we propose a merged attention sublayer (MAtt) which combines a simplified average-based self-attention sublayer and the encoder-decoder attention sublayer on the decoder side. Results on WMT and IWSLT translation tasks with five translation directions show that deep Transformers with DS-Init and MAtt can substantially outperform their base counterpart in terms of BLEU (+1.1 BLEU on average for 12-layer models), while matching the decoding speed of the baseline model thanks to the efficiency improvements of MAtt. Source code for reproduction will be released soon.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.18653/v1/D19-1083
Other links	https://github.com/bzhangGo/zero
Downloads	D19-1083 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention