Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

Main:11 Pages
1 Figures
Bibliography:4 Pages
1 Tables
Appendix:18 Pages
Abstract
We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is with being the sample size, depending only on the smoothness of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.
View on arXivComments on this paper
