114
v1v2 (latest)

Intrinsic training dynamics of deep neural networks

Main:10 Pages
Bibliography:2 Pages
1 Tables
Appendix:22 Pages
Abstract

A fundamental challenge in the theory of deep learning is to understand whether gradient-based training can promote parameters belonging to certain lower-dimensional structures (e.g., sparse or low-rank sets), leading to so-called implicit bias. As a stepping stone, motivated by the proof structure of existing implicit bias analyses, we study when a gradient flow on a parameter θ\theta implies an intrinsic gradient flow on a ``lifted'' variable z=ϕ(θ)z = \phi(\theta), for an architecture-related function ϕ\phi. We express a so-called intrinsic dynamic property and show how it is related to the study of conservation laws associated with the factorization ϕ\phi. This leads to a simple criterion based on the inclusion of kernels of linear maps, which yields a necessary condition for this property to hold. We then apply our theory to general ReLU networks of arbitrary depth and show that, for a dense set of initializations, it is possible to rewrite the flow as an intrinsic dynamic in a lower dimension that depends only on zz and the initialization, when ϕ\phi is the so-called path-lifting. In the case of linear networks with ϕ\phi, the product of weight matrices, the intrinsic dynamic is known to hold under so-called balanced initializations; we generalize this to a broader class of {\em relaxed balanced} initializations, showing that, in certain configurations, these are the \emph{only} initializations that ensure the intrinsic metric property. Finally, for the linear neural ODE associated with the limit of infinitely deep linear networks, with relaxed balanced initialization, we make explicit the corresponding intrinsic dynamics.

View on arXiv
Comments on this paper