v1v2v3 (latest)

TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs

Symposium on Networked Systems Design and Implementation (NSDI), 2022

1 February 2022

Papers citing "TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs"

27 / 27 papers shown

Enabling Reconfiguration-Communication Overlap for Collective Communication in Optical Networks

22 Oct 2025

Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective

248

12 Sep 2025

HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap

128

13 Aug 2025

Morphlux: Transforming Torus Fabrics for Efficient Multi-tenant ML

Abhishek Vijaya Kumar

166

20 Jul 2025

Efficient AllReduce with Stragglers

Arjun Devraj

Eric Ding

Abhishek Vijaya Kumar

Robert Kleinberg

Rachee Singh

286

29 May 2025

Towards Easy and Realistic Network Infrastructure Testing for Large-scale Machine Learning

219

29 Apr 2025

Routing for Large ML Models

Ofir Cohen

Jose Yallouz Michael Schapira

Shahar Belkar

Tal Mizrahi

186

07 Mar 2025

MixNet: A Runtime Reconfigurable Optical-Electrical Fabric for Distributed Mixture-of-Experts TrainingConference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), 2025

...

432

07 Jan 2025

LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs

321

02 Sep 2024

Efficient Training of Large Language Models on Distributed Infrastructures: A Survey

...

Dahua Lin

Yonggang Wen

Xin Jin

Tianwei Zhang

Yang Liu

370

29 Jul 2024

VcLLM: Video Codecs are Secretly Tensor Codecs

Danyang Zhuo

226

29 Jun 2024

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

Xiping Hu

354

09 Apr 2024

MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms

194

03 Apr 2024

Communication Optimization for Distributed Training: Architecture, Advances, and OpportunitiesIEEE Network (IEEE Netw.), 2024

155

12 Mar 2024

MLTCP: Congestion Control for DNN Training

175

14 Feb 2024

ForestColl: Throughput-Optimal Collective Communications on Heterogeneous Network Fabrics

256

09 Feb 2024

Swing: Short-cutting Rings for Higher Bandwidth AllreduceSymposium on Networked Systems Design and Implementation (NSDI), 2024

219

17 Jan 2024

Credence: Augmenting Datacenter Switch Buffer Sharing with ML Predictions

Vamsi Addanki

Maciej Pacut

Stefan Schmid

149

05 Jan 2024

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

332

06 Dec 2023

MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed SystemsInternational Symposium on Computer Architecture (ISCA), 2023

David Brooks

362

04 Oct 2023

Efficient All-to-All Collective Communication Schedules for Direct-Connect TopologiesIEEE International Symposium on High-Performance Parallel Distributed Computing (HPDC), 2023

198

24 Sep 2023

CASSINI: Network-Aware Job Scheduling in Machine Learning ClustersSymposium on Networked Systems Design and Implementation (NSDI), 2023

148

01 Aug 2023

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine LearningMicro (MICRO), 2023

279

11 Apr 2023

TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for EmbeddingsInternational Symposium on Computer Architecture (ISCA), 2023

...

526

549

04 Apr 2023

THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic CompressionSymposium on Networked Systems Design and Implementation (NSDI), 2023

Minlan Yu Harvard University

348

16 Feb 2023

Efficient Direct-Connect Topologies for Collective CommunicationsSymposium on Networked Systems Design and Implementation (NSDI), 2022

391

07 Feb 2022

LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI ModelsIEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021

214

24 Sep 2021