351
v1v2v3v4v5 (latest)

Understanding and Improving Shampoo and SOAP via Kullback-Leibler Minimization

Main:18 Pages
12 Figures
Bibliography:5 Pages
2 Tables
Appendix:1 Pages
Abstract

Shampoo and its efficient variant, SOAP, employ structured second-moment estimations and have shown strong performance for training neural networks (NNs). In practice, however, Shampoo typically requires step-size grafting with Adam to be competitive, and SOAP mitigates this by applying Adam in Shampoo's eigenbasis -- at the cost of additional memory overhead from Adam in both methods. Prior analyses have largely relied on the Frobenius norm to motivate these estimation schemes. We instead recast their estimation procedures as covariance estimation under Kullback-Leibler (KL) divergence minimization, revealing a previously overlooked theoretical limitation and motivating principled redesigns. Building on this perspective, we develop KL-Shampoo\textbf{KL-Shampoo} and KL-SOAP\textbf{KL-SOAP}, practical schemes that match or exceed the performance of Shampoo and SOAP in NN pre-training while achieving SOAP-level per-iteration runtime. Notably, KL-Shampoo does not rely on Adam to attain competitive performance, eliminating the memory overhead introduced by Adam. Across our experiments, KL-Shampoo consistently outperforms SOAP, Shampoo, and even KL-SOAP, establishing the KL-based approach as a compelling foundation for designing structured methods in NN optimization.

View on arXiv
Comments on this paper