Early-stopping for Transformer model training

This work, based on Random Matrix Theory (RMT), introduces a novel early-stopping strategy for Transformer training dynamics. Utilizing the Power Law (PL) fit to tansformer attention matrices as a probe, we demarcate training into three stages: structural exploration, heavy-tailed structure stabilization, and convergence saturation. Empirically, we observe that the spectral density of the shallow self-attention matrix consistently evolves into a heavy-tailed distribution. Crucially, we propose two consistent and validation-set-free criteria: a quantitative metric for heavy-tailed dynamics and a novel spectral signature indicative of convergence. The strong alignment between these criteria highlights the utility of RMT for monitoring and diagnosing the progression of Transformer model training.
View on arXiv