v1v2v3 (latest)

TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs

Symposium on Networked Systems Design and Implementation (NSDI), 2022

1 February 2022

Papers citing "TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs"

27 / 27 papers shown

Enabling Reconfiguration-Communication Overlap for Collective Communication in Optical Networks

22 Oct 2025

Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective

222

12 Sep 2025

HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap

125

13 Aug 2025

Morphlux: Transforming Torus Fabrics for Efficient Multi-tenant ML

Abhishek Vijaya Kumar

146

20 Jul 2025

Efficient AllReduce with Stragglers

Arjun Devraj

Eric Ding

Abhishek Vijaya Kumar

Robert Kleinberg

Rachee Singh

265

29 May 2025

Towards Easy and Realistic Network Infrastructure Testing for Large-scale Machine Learning

202

29 Apr 2025

Routing for Large ML Models

Ofir Cohen

Jose Yallouz Michael Schapira

Shahar Belkar

Tal Mizrahi

185

07 Mar 2025

MixNet: A Runtime Reconfigurable Optical-Electrical Fabric for Distributed Mixture-of-Experts TrainingConference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), 2025

...

424

07 Jan 2025

LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs

312

02 Sep 2024

Efficient Training of Large Language Models on Distributed Infrastructures: A Survey

...

Dahua Lin

Yonggang Wen

Xin Jin

Tianwei Zhang

Yang Liu

369

29 Jul 2024

VcLLM: Video Codecs are Secretly Tensor Codecs

Danyang Zhuo

218

29 Jun 2024

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

Xiping Hu

350

09 Apr 2024

MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms

194

03 Apr 2024

Communication Optimization for Distributed Training: Architecture, Advances, and OpportunitiesIEEE Network (IEEE Netw.), 2024

143

12 Mar 2024

MLTCP: Congestion Control for DNN Training

164

14 Feb 2024

ForestColl: Throughput-Optimal Collective Communications on Heterogeneous Network Fabrics

231

09 Feb 2024

Swing: Short-cutting Rings for Higher Bandwidth AllreduceSymposium on Networked Systems Design and Implementation (NSDI), 2024

204

17 Jan 2024

Credence: Augmenting Datacenter Switch Buffer Sharing with ML Predictions

Vamsi Addanki

Maciej Pacut

Stefan Schmid

142

05 Jan 2024

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

314

06 Dec 2023

MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed SystemsInternational Symposium on Computer Architecture (ISCA), 2023

David Brooks

330

04 Oct 2023

Efficient All-to-All Collective Communication Schedules for Direct-Connect TopologiesIEEE International Symposium on High-Performance Parallel Distributed Computing (HPDC), 2023

195

24 Sep 2023

CASSINI: Network-Aware Job Scheduling in Machine Learning ClustersSymposium on Networked Systems Design and Implementation (NSDI), 2023

135

01 Aug 2023

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine LearningMicro (MICRO), 2023

272

11 Apr 2023

TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for EmbeddingsInternational Symposium on Computer Architecture (ISCA), 2023

...

521

538

04 Apr 2023

THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic CompressionSymposium on Networked Systems Design and Implementation (NSDI), 2023

Minlan Yu Harvard University

339

16 Feb 2023

Efficient Direct-Connect Topologies for Collective CommunicationsSymposium on Networked Systems Design and Implementation (NSDI), 2022

385

07 Feb 2022

LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI ModelsIEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021

204

24 Sep 2021