460

Stable Weight Decay Regularization

Neural Information Processing Systems (NeurIPS), 2020
Zeke Xie
Issei Sato
Abstract

Weight decay is a popular regularization technique for training of deep neural networks. Modern deep learning libraries mainly use L2L_{2} regularization as the default implementation of weight decay. \citet{loshchilov2018decoupled} demonstrated that L2L_{2} regularization is not identical to weight decay for adaptive gradient methods, such as Adaptive Momentum Estimation (Adam), and proposed Adam with Decoupled Weight Decay (AdamW). However, we found that the popular implementations of weight decay, including L2L_{2} regularization and decoupled weight decay, in modern deep learning libraries usually damage performance. First, the L2L_{2} regularization is unstable weight decay for all optimizers that use Momentum, such as stochastic gradient descent (SGD). Second, decoupled weight decay is highly unstable for all adaptive gradient methods. We further propose the Stable Weight Decay (SWD) method to fix the unstable weight decay problem from a dynamical perspective. The proposed SWD method makes significant improvements over L2L_{2} regularization and decoupled weight decay in our experiments. Simply fixing weight decay in Adam by SWD, with no extra hyperparameter, can usually outperform complex Adam variants, which have more hyperparameters.

View on arXiv
Comments on this paper