v1v2 (latest)
Minimax Rates for Learning Pairwise Interactions in Attention-Style Models
Main:10 Pages
1 Figures
Bibliography:6 Pages
1 Tables
Appendix:19 Pages
Abstract
We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a nonlinear activation function. We prove that the minimax rate is , where is the sample size and is the Hölder smoothness of the activation function. Importantly, this rate is independent of the embedding dimension , the number of tokens , and the rank of the weight matrix, provided that . These results highlight a fundamental statistical efficiency of attention-style models, even when the weight matrix and activation are not separately identifiable, and provide a theoretical understanding of attention mechanisms and guidance on training.
View on arXivComments on this paper
