104
v1v2 (latest)

Early-stopping for Transformer model training

Main:19 Pages
15 Figures
Bibliography:1 Pages
5 Tables
Appendix:17 Pages
Abstract

This work, based on Random Matrix Theory (RMT), introduces a novel early-stopping strategy for Transformer training dynamics. Utilizing the Power Law (PL) fit to tansformer attention matrices as a probe, we demarcate training into three stages: structural exploration, heavy-tailed structure stabilization, and convergence saturation. Empirically, we observe that the spectral density of the shallow self-attention matrix VV consistently evolves into a heavy-tailed distribution. Crucially, we propose two consistent and validation-set-free criteria: a quantitative metric for heavy-tailed dynamics and a novel spectral signature indicative of convergence. The strong alignment between these criteria highlights the utility of RMT for monitoring and diagnosing the progression of Transformer model training.

View on arXiv
Comments on this paper