OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
Jiacheng Liu
Taylor Blanton
Yanai Elazar
Sewon Min
YenSung Chen
Arnavi Chheda-Kothary
H. Tran
Byron Bischoff
Eric Stuart Marsh
Michael Schmitz
Cassidy Trier
Aaron Sarnat
Jenna James
Jon Borchardt
Bailey Kuehl
Evie (Yu-Yen) Cheng
Karen Farley
Sruthi Sreeram
Taira Anderson
David Albright
Carissa Schoenick
Luca Soldaini
Dirk Groeneveld
Rock Yuren Pang
Pang Wei Koh
Noah A. Smith
Sophie Lebrecht
Yejin Choi
Hannaneh Hajishirzi
Ali Farhadi
Jesse Dodge

Abstract
We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.
View on arXiv@article{liu2025_2504.07096, title={ OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens }, author={ Jiacheng Liu and Taylor Blanton and Yanai Elazar and Sewon Min and YenSung Chen and Arnavi Chheda-Kothary and Huy Tran and Byron Bischoff and Eric Marsh and Michael Schmitz and Cassidy Trier and Aaron Sarnat and Jenna James and Jon Borchardt and Bailey Kuehl and Evie Cheng and Karen Farley and Sruthi Sreeram and Taira Anderson and David Albright and Carissa Schoenick and Luca Soldaini and Dirk Groeneveld and Rock Yuren Pang and Pang Wei Koh and Noah A. Smith and Sophie Lebrecht and Yejin Choi and Hannaneh Hajishirzi and Ali Farhadi and Jesse Dodge }, journal={arXiv preprint arXiv:2504.07096}, year={ 2025 } }
Comments on this paper