233

On the Capacity of Self-Attention

Main:25 Pages
14 Figures
Bibliography:8 Pages
2 Tables
Appendix:24 Pages
Abstract

While self-attention is known to learn relations among tokens, we lack a formal understanding of its capacity: how many distinct relations can a single layer reliably recover for a given budget?

View on arXiv
Comments on this paper