Sparsifying Transformer Models with Differentiable Representation Pooling

Annual Meeting of the Association for Computational Linguistics (ACL), 2020

10 September 2020

Abstract

We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations, thus leveraging the model's information bottleneck with twofold strength. A careful analysis shows that the contextualization of encoded representations in our model is significantly more effective than in the original Transformer. We achieve a notable reduction in memory usage due to an improved differentiable top-k operator, making the model suitable to process long documents, as shown on an example of a summarization task.

View on arXiv

Comments on this paper