13
41

On the Maximum Hessian Eigenvalue and Generalization

Abstract

The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly λmax\lambda_{max}, the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM) [1], that directly optimize for flatness. Other works question the link between λmax\lambda_{max} and generalization. In this paper, we present findings that call λmax\lambda_{max}'s influence on generalization further into question. We show that: (1) while larger learning rates reduce λmax\lambda_{max} for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change λmax\lambda_{max} without affecting generalization; (3) while SAM produces smaller λmax\lambda_{max} for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller λmax\lambda_{max}; and (5) while batch-normalization does not consistently produce smaller λmax\lambda_{max}, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to λmax\lambda_{max}'s ability to explain generalization in neural networks.

View on arXiv
Comments on this paper