Causal Transformers are trained to predict the next token for a given context. While it is widely accepted that self-attention is crucial for encoding the causal structure of sequences, the precise underlying mechanism behind this in-context autoregressive learning ability remains unclear. In this paper, we take a step towards understanding this phenomenon by studying the approximation ability of Transformers for next-token prediction. Specifically, we explore the capacity of causal Transformers to predict the next token given an autoregressive sequence as a prompt, where , and is a context-dependent function that varies with each sequence. On the theoretical side, we focus on specific instances, namely when is linear or when is periodic. We explicitly construct a Transformer (with linear, exponential, or softmax attention) that learns the mapping in-context through a causal kernel descent method. The causal kernel descent method we propose provably estimates based solely on past and current observations , with connections to the Kaczmarz algorithm in Hilbert spaces. We present experimental results that validate our theoretical findings and suggest their applicability to more general mappings .
View on arXiv@article{sander2025_2410.03011, title={ Towards Understanding the Universality of Transformers for Next-Token Prediction }, author={ Michael E. Sander and Gabriel Peyré }, journal={arXiv preprint arXiv:2410.03011}, year={ 2025 } }