226
v1v2v3 (latest)

A Capacity-Based Rationale for Multi-Head Attention

Main:25 Pages
14 Figures
Bibliography:8 Pages
2 Tables
Appendix:24 Pages
Abstract

We study the capacity of the self-attention key-query channel: for a fixed budget, how many distinct token-token relations can a single layer reliably encode? We introduce Relational Graph Recognition, where the key-query channel encodes a directed graph and, given a context (a subset of the vertices), must recover the neighbors of each vertex in the context. We measure resources by the total key dimension DK=hdkD_K = h\,d_k. In a tractable multi-head model, we prove matching information-theoretic lower bounds and upper bounds via explicit constructions showing that recovering a graph with mm' relations in dmodeld_{\text{model}}-dimensional embeddings requires DKD_K to grow essentially as m/dmodelm'/d_{\text{model}} up to logarithmic factors, and we obtain corresponding guarantees for scaled-softmax attention. This analysis yields a new, capacity-based rationale for multi-head attention: even in permutation graphs, where all queries attend to a single target, splitting a fixed DKD_K budget into multiple heads increases capacity by reducing interference from embedding superposition. Controlled experiments mirror the theory, revealing sharp phase transitions at the predicted capacity, and the multi-head advantage persists when adding softmax normalization, value routing, and a full Transformer block trained with frozen GPT-2 embeddings.

View on arXiv
Comments on this paper