On the Implicit Bias Towards Minimal Depth of Deep Neural Networks

We study the implicit bias of stochastic gradient descent to favor low-depth solutions when training deep neural networks. Recent results in the literature suggest that penultimate layer representations learned by a classifier over multiple classes exhibit a clustering property, called neural collapse. First, we empirically show that neural collapse generally strengthens when increasing the number of layers. In addition, we demonstrate that neural collapse extends beyond the penultimate layer and emerges in intermediate layers as well, making the higher layers essentially redundant. We characterize the notion of effective depth which measures the minimal layer that enjoys neural collapse. In this regard, we hypothesize and empirically show that gradient descent implicitly selects neural networks of small effective depths. Finally, we empirically show that the effective depth of a trained neural network generally increases when training with extended portions of random labels and theoretically connect it with generalization.
View on arXiv