Enabling Reconfiguration-Communication Overlap for Collective Communication in Optical Networks
Collective communication (CC) is critical for scaling distributed machine learning (DML). The predictable traffic patterns of DML present a great oppotunity for applying optical network technologies. Optical networks with reconfigurable topologies promise high bandwidth and low latency for collective communications. However, existing approaches face inherent limitations: static topologies are inefficient for dynamic communication patterns within CC algorithm, while frequent topology reconfiguration matching every step of the algorithm incurs significant overhead.In this paper, we propose SWOT, a demand-aware optical network framework that employs ``intra-collective reconfiguration'' to dynamically align network resources with CC traffic patterns. SWOT hides reconfiguration latency by overlapping it with data transmission through three key techniques: Heterogeneous Message Splitting, Asynchronous Overlapping, and Topology Bypassing. Extensive simulations demonstrate that SWOT reduces communication completion time up to 89.7% across diverse CC algorithm compared to static baselines, demonstrating strong robustness to varying optical resources and reconfiguration delay.
View on arXiv