v1v2 (latest)

DeMo: Decoupled Momentum Optimization

29 November 2024

Bowen Peng

Jeffrey Quesnelle

Diederik P. Kingma

Jeffrey Quesnelle

Diederik P. Kingma

Qiang Liu

ArXiv (abs)PDF HTML HuggingFace (6 upvotes)Github (198★)

Main:9 Pages

8 Figures

Bibliography:4 Pages

3 Tables

Appendix:13 Pages

Abstract

Scaling neural network training increasingly depends on synchronous data-parallelism, yet full-precision gradient all-reduce imposes a severe communication bottleneck. We propose Decoupled Momentum Optimization (DeMo), a drop-in replacement for any momentum-based optimizers that significantly reduces the communication bandwidth while maintaining convergence. DeMo (i) decouples local momentum updates, (ii) applies a fast orthonormal transform (e.g., DCT) followed by top-k sparsification, and (iii) reuses the momentum buffer as error feedback via momentum subtraction. This design reduces per-step communication by up to two orders of magnitude with minimal computational overhead. Experiments on 300M and 1B-parameter DeMo language models show DeMo transmits up to 85x less data per GPU than AdamW-DDP while achieving comparable loss and accuracy. DeMo is topology-agnostic and enables training across multi-datacenter or Ethernet-based setups. Code is available atthis https URL

View on arXiv

Comments on this paper