Sparsifying Transformer Models with Differentiable Representation
Pooling
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
Abstract
We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations, thus leveraging the model's information bottleneck with twofold strength. A careful analysis shows that the contextualization of encoded representations in our model is significantly more effective than in the original Transformer. We achieve a notable reduction in memory usage due to an improved differentiable top-k operator, making the model suitable to process long documents, as shown on an example of a summarization task.
View on arXivComments on this paper
