15

LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

Andrej Jovanović
Alex Iacob
Mher Safaryan
Ionut-Vlad Modoranu
Lorenzo Sani
William F. Shen
Xinchi Qiu
Dan Alistarh
Nicholas D. Lane
Main:8 Pages
13 Figures
Bibliography:3 Pages
4 Tables
Appendix:18 Pages
Abstract

Distributed training of foundation models via DDP\texttt{DDP} is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose LoRDO\texttt{LoRDO}, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. LoRDO\texttt{LoRDO} achieves near-parity with low-rank DDP\texttt{DDP} in language modeling and downstream tasks at model scales of 125125M--720720M, while reducing communication by 10×\approx 10 \times. Finally, we show that LoRDO\texttt{LoRDO} improves performance even more in very low-memory settings with small rank/batch size.

View on arXiv
Comments on this paper