On the Capacity of Self-Attention
Main:25 Pages
14 Figures
Bibliography:8 Pages
2 Tables
Appendix:24 Pages
Abstract
While self-attention is known to learn relations among tokens, we lack a formal understanding of its capacity: how many distinct relations can a single layer reliably recover for a given budget?
View on arXivComments on this paper
