357
v1v2 (latest)

Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity

Abstract

The dynamics of Deep Linear Networks (DLNs) is dramatically affected by the variance σ2\sigma^2 of the parameters at initialization θ0\theta_0. For DLNs of width ww, we show a phase transition w.r.t. the scaling γ\gamma of the variance σ2=wγ\sigma^2=w^{-\gamma} as ww\to\infty: for large variance (γ<1\gamma<1), θ0\theta_0 is very close to a global minimum but far from any saddle point, and for small variance (γ>1\gamma>1), θ0\theta_0 is close to a saddle point and far from any global minimum. While the first case corresponds to the well-studied NTK regime, the second case is less understood. This motivates the study of the case γ+\gamma \to +\infty, where we conjecture a Saddle-to-Saddle dynamics: throughout training, gradient descent visits the neighborhoods of a sequence of saddles, each corresponding to linear maps of increasing rank, until reaching a sparse global minimum. We support this conjecture with a theorem for the dynamics between the first two saddles, as well as some numerical experiments.

View on arXiv
Comments on this paper