v1v2 (latest)

Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

13 October 2025

Main:10 Pages

1 Figures

Bibliography:6 Pages

1 Tables

Appendix:19 Pages

Abstract

We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a nonlinear activation function. We prove that the minimax rate is $M^{-\frac{2\beta}{2\beta+1}}$ , where $M$ is the sample size and $\beta$ is the Hölder smoothness of the activation function. Importantly, this rate is independent of the embedding dimension $d$ , the number of tokens $N$ , and the rank $r$ of the weight matrix, provided that $rd \le (M/\log M)^{\frac{1}{2\beta+1}}$ . These results highlight a fundamental statistical efficiency of attention-style models, even when the weight matrix and activation are not separately identifiable, and provide a theoretical understanding of attention mechanisms and guidance on training.

View on arXiv

Comments on this paper