ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.03011
29
0

Towards Understanding the Universality of Transformers for Next-Token Prediction

3 October 2024
Michael E. Sander
Gabriel Peyré
    CML
ArXivPDFHTML
Abstract

Causal Transformers are trained to predict the next token for a given context. While it is widely accepted that self-attention is crucial for encoding the causal structure of sequences, the precise underlying mechanism behind this in-context autoregressive learning ability remains unclear. In this paper, we take a step towards understanding this phenomenon by studying the approximation ability of Transformers for next-token prediction. Specifically, we explore the capacity of causal Transformers to predict the next token xt+1x_{t+1}xt+1​ given an autoregressive sequence (x1,…,xt)(x_1, \dots, x_t)(x1​,…,xt​) as a prompt, where xt+1=f(xt) x_{t+1} = f(x_t) xt+1​=f(xt​), and f f f is a context-dependent function that varies with each sequence. On the theoretical side, we focus on specific instances, namely when f f f is linear or when (xt)t≥1 (x_t)_{t \geq 1} (xt​)t≥1​ is periodic. We explicitly construct a Transformer (with linear, exponential, or softmax attention) that learns the mapping fff in-context through a causal kernel descent method. The causal kernel descent method we propose provably estimates xt+1x_{t+1} xt+1​ based solely on past and current observations (x1,…,xt) (x_1, \dots, x_t) (x1​,…,xt​), with connections to the Kaczmarz algorithm in Hilbert spaces. We present experimental results that validate our theoretical findings and suggest their applicability to more general mappings fff.

View on arXiv
@article{sander2025_2410.03011,
  title={ Towards Understanding the Universality of Transformers for Next-Token Prediction },
  author={ Michael E. Sander and Gabriel Peyré },
  journal={arXiv preprint arXiv:2410.03011},
  year={ 2025 }
}
Comments on this paper