89

Enabling Reconfiguration-Communication Overlap for Collective Communication in Optical Networks

Main:17 Pages
13 Figures
Bibliography:2 Pages
1 Tables
Abstract

Collective communication (CC) is widely adopted for large-scale distributed machine learning (DML) training workloads. DML's predictable traffic pattern provides a great oppotunity for applying optical network technology. Existing optical interconnects-based CC schemes adopt ``one-shot network reconfiguration'', which provisions static high-capacity topologies for an entire collective operation -- sometimes for a full training iteration. However, this approach faces significant scalability limitations when supporting more complex and efficient CC algorithms required for modern workloads: the ``one-shot'' strategies either demand excessive resource overprovisioning or suffer performance degradation due to rigid resource allocation.

View on arXiv
Comments on this paper