v1v2 (latest)
Establishing a Scale for Kullback--Leibler Divergence in Language Models Across Various Settings
Main:4 Pages
19 Figures
Bibliography:6 Pages
8 Tables
Appendix:13 Pages
Abstract
Log-likelihood vectors define a common space for comparing language models as probability distributions, enabling unified comparisons across heterogeneous settings. We extend this framework to training checkpoints and intermediate layers, and establish a consistent scale for KL divergence across pretraining, model size, random seeds, quantization, fine-tuning, and layers. Analysis of Pythia pretraining trajectories further shows that changes in log-likelihood space are much smaller than in weight space, resulting in subdiffusive learning trajectories and early stabilization of language-model behavior despite weight drift.
View on arXivComments on this paper
