129

Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

Main:11 Pages
1 Figures
Bibliography:4 Pages
1 Tables
Appendix:18 Pages
Abstract

We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is M2β2β+1M^{-\frac{2\beta}{2\beta+1}} with MM being the sample size, depending only on the smoothness β\beta of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.

View on arXiv
Comments on this paper