358
v1v2 (latest)

Establishing a Scale for Kullback--Leibler Divergence in Language Models Across Various Settings

Main:4 Pages
19 Figures
Bibliography:6 Pages
8 Tables
Appendix:13 Pages
Abstract

Log-likelihood vectors define a common space for comparing language models as probability distributions, enabling unified comparisons across heterogeneous settings. We extend this framework to training checkpoints and intermediate layers, and establish a consistent scale for KL divergence across pretraining, model size, random seeds, quantization, fine-tuning, and layers. Analysis of Pythia pretraining trajectories further shows that changes in log-likelihood space are much smaller than in weight space, resulting in subdiffusive learning trajectories and early stabilization of language-model behavior despite weight drift.

View on arXiv
Comments on this paper