Dion: A Communication-Efficient Optimizer for Large Models

Abstract
Training large AI models efficiently requires distributing computation across multiple accelerators, but this often incurs significant communication overhead -- especially during gradient synchronization. We introduce Dion, a communication-efficient optimizer that retains the synchronous semantics of standard distributed training (e.g., DDP, FSDP) while substantially reducing I/O costs. Unlike conventional optimizers that synchronize full gradient matrices, Dion leverages orthonormalized updates with device-local momentum buffers, eliminating the need for full gradient exchange. It further supports an efficient sharding strategy that avoids reconstructing large matrices during training.
View on arXiv@article{ahn2025_2504.05295, title={ Dion: A Communication-Efficient Optimizer for Large Models }, author={ Kwangjun Ahn and Byron Xu }, journal={arXiv preprint arXiv:2504.05295}, year={ 2025 } }
Comments on this paper