ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.08081
16
25

Mechanics of Next Token Prediction with Self-Attention

12 March 2024
Yingcong Li
Yixiao Huang
M. E. Ildiz
A. S. Rawat
Samet Oymak
ArXivPDFHTML
Abstract

Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: What\textit{What}What does\textit{does}does a\textit{a}a single\textit{single}single self-attention\textit{self-attention}self-attention layer\textit{layer}layer learn\textit{learn}learn from\textit{from}from next-token\textit{next-token}next-token prediction?\textit{prediction?}prediction? We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps: (1)\textbf{(1)}(1) Hard\textbf{Hard}Hard retrieval:\textbf{retrieval:}retrieval: Given input sequence, self-attention precisely selects the high-priority\textit{high-priority}high-priority input\textit{input}input tokens\textit{tokens}tokens associated with the last input token. (2)\textbf{(2)}(2) Soft\textbf{Soft}Soft composition:\textbf{composition:}composition: It then creates a convex combination of the high-priority tokens from which the next token can be sampled. Under suitable conditions, we rigorously characterize these mechanics through a directed graph over tokens extracted from the training data. We prove that gradient descent implicitly discovers the strongly-connected components (SCC) of this graph and self-attention learns to retrieve the tokens that belong to the highest-priority SCC available in the context window. Our theory relies on decomposing the model weights into a directional component and a finite component that correspond to hard retrieval and soft composition steps respectively. This also formalizes a related implicit bias formula conjectured in [Tarzanagh et al. 2023]. We hope that these findings shed light on how self-attention processes sequential data and pave the path toward demystifying more complex architectures.

View on arXiv
Comments on this paper