FlashOptim: Optimizers for Memory Efficient Training

26 February 2026

Jose Javier Gonzalez Ortiz

Abhay Gupta

Chris Renard

Davis Blalock

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github (215★)

Main:8 Pages

9 Figures

Bibliography:4 Pages

9 Tables

Appendix:3 Pages

Abstract

Standard mixed-precision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just the parameter itself, but also its gradient and one or more optimizer state variables. With each of these values typically requiring 4 bytes, training even a 7 billion parameter model can be impractical for researchers with less than 100GB of accelerator memory.

View on arXiv

Comments on this paper