39

FlashOptim: Optimizers for Memory Efficient Training

Jose Javier Gonzalez Ortiz
Abhay Gupta
Chris Renard
Davis Blalock
Main:8 Pages
9 Figures
Bibliography:4 Pages
9 Tables
Appendix:3 Pages
Abstract

Standard mixed-precision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just the parameter itself, but also its gradient and one or more optimizer state variables. With each of these values typically requiring 4 bytes, training even a 7 billion parameter model can be impractical for researchers with less than 100GB of accelerator memory.

View on arXiv
Comments on this paper