254

Li2\mathbf{Li_2}: A Framework on Dynamics of Feature Emergence and Delayed Generalization

Main:13 Pages
17 Figures
Bibliography:4 Pages
Appendix:33 Pages
Abstract

While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open question whether there is a mathematical framework to characterize what kind of features emerge, how and in which conditions it happens from training, for complex structured inputs. We propose a novel framework, named Li2\mathbf{Li_2}, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) Lazy learning, (II) independent feature learning and (III) interactive feature learning, characterized by the structure of backpropagated gradient GFG_F across layers. In (I), GFG_F is random, and top layer overfits to random hidden representation. In (II), the gradient of each node (column of GFG_F) only depends on its own activation, and thus each hidden node learns their representation independently from GFG_F, which now carries information about target labels, thanks to weight decay. Interestingly, the independent dynamics follows exactly the gradient ascent of an energy function EE, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. Finally, in (III), we provably show how hidden nodes interact, and how GFG_F changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of memorization and generalization, and reveals the underlying cause why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layer architectures.

View on arXiv
Comments on this paper