Stochastic Weight Averaging Revisited

Averaging neural network weights sampled by a backbone stochastic gradient descent (SGD) is a simple yet effective approach to assist the backbone SGD in finding better optima, in terms of generalization. From a statistical perspective, weight averaging (WA) contributes to variance reduction. Recently, a well-established stochastic weight averaging (SWA) method is proposed, which is featured by the application of a cyclical or high constant (CHC) learning rate schedule (LRS) in the process of generating weight samples for the WA operation. Then a new insight on WA appears, which states that WA helps to discover wider optima and then leads to better generalization. We conduct extensive experimental studies for SWA, involving a dozen modern DNN model structures and a dozen benchmark open-source image, graph, and text datasets. We disentangle contributions of the WA operation and the CHC LRS for SWA, showing that the WA operation in SWA still contributes to variance reduction but does not always lead to wide optima. We show how the statistical and geometric views on SWA reconcile. Based on our experimental findings, we raise a hypothesis that there are global scale geometric structures in the DNN loss landscape that can be discovered by an SGD agent at the early stage of its working period, and such global geometric structures can be exploited by the WA operation. This hypothesis inspires an algorithm design termed periodic SWA (PSWA). We find that PSWA outperforms its backbone SGD remarkably during the early stage of the SGD sampling process, and thus demonstrate that our hypothesis holds. Codes for reproducing the experimental results can be found at https://github.com/ZJLAB-AMMI/PSWA.
View on arXiv