v1v2 (latest)

Pretraining with Token-Level Adaptive Latent Chain-of-Thought

9 February 2026

Boyi Zeng

Yiqin Hao

He Li

Shixiang Song

Feichen Song

Zitong Wang

Siyuan Huang

Yi Xu

ZiWei He

Xinbing Wang

Zhouhan Lin

LRM

AI4CE

ArXiv (abs)PDF HTML Github

Main:12 Pages

8 Figures

Bibliography:3 Pages

2 Tables

Abstract

Scaling large language models by increasing parameters and training data is increasingly constrained by limited high-quality corpora and rising communication costs. This work explores an alternative axis: increasing per-token computation without expanding parameters, by internalizing latent Chain-of-Thought (CoT) into pretraining. We propose Pretraining with Token-Level Adaptive Latent CoT (adaptive latent CoT), where the model generates a variable-length latent CoT trajectory before emitting each token -- allocating longer trajectories to difficult tokens and shorter (or even zero) trajectories to easy ones. Importantly, this behavior emerges naturally from one-stage pretraining on general text and reduces computation in both training and inference via token-wise adaptive halting. Experiments with Llama architectures show that adaptive latent CoT consistently improves language modeling perplexity and broad downstream accuracy, even with fewer training FLOPs than prior recurrent baselines.

View on arXiv

Comments on this paper