On Sparsifying Encoder Outputs in Sequence-to-Sequence Models

B. Zhang; I. Titov; R. Sennrich

doi:https://doi.org/10.18653/v1/2021.findings-acl.255

On Sparsifying Encoder Outputs in Sequence-to-Sequence Models

Authors	B. Zhang I. Titov R. Sennrich
Publication date	2021
Host editors	C. Zong F. Xia W. Li R. Navigli
Book title	Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
Book subtitle	Findings of ACL: ACL-IJCNLP 2021 : August 1-6, 2021
ISBN (electronic)	9781954085541
Event	The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)
Pages (from-to)	2888-2900
Number of pages	13
Publisher	Stroudsburg, PA: The Association for Computational Linguistics
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	Sequence-to-sequence models usually transfer all encoder outputs to the decoder for generation. In this work, by contrast, we hypothesize that these encoder outputs can be compressed to shorten the sequence delivered for decoding. We take Transformer as the testbed and introduce a layer of stochastic gates in-between the encoder and the decoder. The gates are regularized using the expected value of the sparsity-inducing L₀ penalty, resulting in completely masking-out a subset of encoder outputs. In other words, via joint training, the L₀DROP layer forces Transformer to route information through a subset of its encoder states. We investigate the effects of this sparsification on two machine translation and two summarization tasks. Experiments show that, depending on the task, around 40-70% of source encodings can be pruned without significantly compromising quality. The decrease of the output length endows L₀DROP with the potential of improving decoding efficiency, where it yields a speedup of up to 1.65× on document summarization and 1.20× on character-based machine translation against the standard Transformer. We analyze the L₀DROP behaviour and observe that it exhibits systematic preferences for pruning certain word types, e.g., function words and punctuation get pruned most. Inspired by these observations, we explore the feasibility of specifying rule-based patterns that mask out encoder outputs based on information such as part-of-speech tags, word frequency and word position.
Document type	Conference contribution
Note	With supplementary video
Language	English
Published at	https://doi.org/10.18653/v1/2021.findings-acl.255 (Final published version)
Other links	https://github.com/bzhangGo/zero https://www.scopus.com/pages/publications/85118010703
Downloads	2021.findings-acl.255 (Final published version)
Supplementary materials	2021.findings-acl.255
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

On Sparsifying Encoder Outputs in Sequence-to-Sequence Models