v1v2v3 (latest)

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

21 January 2026

Yuval Ran-Milo

Yotam Alexander

Shahar Mendel

Nadav Cohen

OffRL

ReLM

LRM

ArXiv (abs)PDF HTML Github (1345★)

Main:10 Pages

7 Figures

Bibliography:4 Pages

3 Tables

Appendix:73 Pages

Abstract

Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy gradient dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of "simple examples": instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler examples, the Transformer learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, policy gradient learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.

View on arXiv

Comments on this paper