169
v1v2 (latest)

Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

Main:10 Pages
1 Figures
Bibliography:6 Pages
1 Tables
Appendix:19 Pages
Abstract

We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a nonlinear activation function. We prove that the minimax rate is M2β2β+1M^{-\frac{2\beta}{2\beta+1}}, where MM is the sample size and β\beta is the Hölder smoothness of the activation function. Importantly, this rate is independent of the embedding dimension dd, the number of tokens NN, and the rank rr of the weight matrix, provided that rd(M/logM)12β+1rd \le (M/\log M)^{\frac{1}{2\beta+1}}. These results highlight a fundamental statistical efficiency of attention-style models, even when the weight matrix and activation are not separately identifiable, and provide a theoretical understanding of attention mechanisms and guidance on training.

View on arXiv
Comments on this paper