133
v1v2 (latest)

On the Capacity of Self-Attention

Main:13 Pages
9 Figures
Bibliography:7 Pages
2 Tables
Appendix:19 Pages
Abstract

While self-attention is known to learn relations among tokens, we lack a formal understanding of its capacity: how many distinct relations can a single layer reliably recover for a given budget?To formalize this, we introduce Relational Graph Recognition (RGR), where the key-query channel represents a graph on mm items with mm' directed edges, and, given a context of items, must recover the neighbors of each item. We measure resources by the total key dimension DK=hdkD_K = h\,d_k. Within this framework, we analytically derive a capacity scaling law and validate it empirically. We show that DK=Θ(mlogm/dmodel)D_K = \Theta(m' \log m' / d_{\text{model}}) is both necessary (information-theoretic lower bound) and sufficient (explicit construction) in a broad class of graphs to recover mm' relations. This scaling law directly leads to a new, capacity-based rationale for multi-head attention that applies even when each item only attends to a single target. When embeddings are uncompressed (m=dmodelm = d_{\text{model}}) and the graph is a permutation, a single head suffices. However, compression (m>dmodelm > d_{\text{model}}) forces relations into overlapping subspaces, creating interference that a single large head cannot disentangle. Our analysis shows that allocating a fixed DKD_K across many small heads mitigates this interference, increasing the number of recoverable relations. Controlled single-layer experiments mirror the theory, revealing a sharp performance threshold that matches the predicted capacity scaling and confirms the benefit of distributing DKD_K across multiple heads.Altogether, these results provide a concrete scaling law for self-attention capacity and a principled design rule for allocating key-query budget across heads.

View on arXiv
Comments on this paper