246

Recipes for Pre-training LLMs with MXFP8

Abstract

Precision scaling - using fewer bits to represent model parameters and related tensors during pre-training - has emerged as a compelling technique for improving GPU efficiency without sacrificing accuracy. Microscaling (MX) formats in NVIDIA's latest Blackwell GPUs represent a major leap in enabling this precision scaling aspect. These formats combine narrow floating-point data types with per-block scaling factors, offering a fine-grained approach to quantizing tensors.

View on arXiv
Main:7 Pages
7 Figures
Bibliography:4 Pages
3 Tables
Appendix:2 Pages
Comments on this paper