47
v1v2 (latest)

Learning In-context n-grams with Transformers: Sub-n-grams Are Near-stationary Points

Main:9 Pages
7 Figures
Bibliography:2 Pages
Appendix:29 Pages
Abstract

Motivated by empirical observations of prolonged plateaus and stage-wise progression during training, we investigate the loss landscape of transformer models trained on in-context next-token prediction tasks. In particular, we focus on learning in-context nn-gram language models under cross-entropy loss, and establish a sufficient condition for parameter configurations to be stationary points. We then construct a set of parameter configurations for a simplified transformer model that represent kk-gram estimators (for knk \leq n), and show that the gradient of the population loss at these solutions vanishes in the limit of infinite sequence length and parameter norm. This reveals a key property of the loss landscape: {sub-nn-grams are near-stationary points of the population cross-entropy loss}, offering theoretical insight into widely observed phenomena such as stage-wise learning dynamics and emergent phase transitions. These insights are further supported by numerical experiments that illustrate the learning dynamics of nn-grams, characterized by discrete transitions between near-stationary solutions.

View on arXiv
Comments on this paper