312

Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

Main:9 Pages
9 Figures
Bibliography:3 Pages
5 Tables
Appendix:5 Pages
Abstract

Grokking, i.e., test performance keeps improving long after training loss converged, has been recently witnessed in neural network training, making the mechanism of generalization and other emerging capabilities such as reasoning mysterious. While prior studies usually train small models on a few toy or highly-specific tasks for thousands of epochs, we conduct the first study of grokking on checkpoints during one-pass pretraining of a 7B large language model (LLM), i.e., OLMoE. We compute the training loss and evaluate generalization on diverse benchmark tasks, including math reasoning, code generation, and commonsense/domain-specific knowledge retrieval tasks.

View on arXiv
Comments on this paper