FlashOptim: Optimizers for Memory Efficient Training
Jose Javier Gonzalez Ortiz
Abhay Gupta
Chris Renard
Davis Blalock
- MQ
Main:8 Pages
9 Figures
Bibliography:4 Pages
9 Tables
Appendix:3 Pages
Abstract
Standard mixed-precision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just the parameter itself, but also its gradient and one or more optimizer state variables. With each of these values typically requiring 4 bytes, training even a 7 billion parameter model can be impractical for researchers with less than 100GB of accelerator memory.
View on arXivComments on this paper
