ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2202.00433
  4. Cited By
TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for
  Distributed Training Jobs
v1v2v3 (latest)

TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs

Symposium on Networked Systems Design and Implementation (NSDI), 2022
1 February 2022
Weiyang Wang
Moein Khazraee
Zhizhen Zhong
M. Ghobadi
Zhihao Jia
Dheevatsa Mudigere
Ying Zhang
A. Kewitsch
ArXiv (abs)PDFHTML

Papers citing "TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs"

27 / 27 papers shown
Enabling Reconfiguration-Communication Overlap for Collective Communication in Optical Networks
Enabling Reconfiguration-Communication Overlap for Collective Communication in Optical Networks
Changbo Wu
Zhuolong Yu
Gongming Zhao
Hongli Xu
87
0
0
22 Oct 2025
Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective
Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective
Seokjin Go
Joongun Park
Spandan More
Hanjiang Wu
Irene Wang
Aaron Jezghani
Tushar Krishna
Divya Mahajan
222
2
0
12 Sep 2025
HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap
HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap
Wenxiang Lin
Xinglin Pan
Lin Zhang
Shaohuai Shi
Xuan Wang
Xiaowen Chu
MoE
125
1
0
13 Aug 2025
Morphlux: Transforming Torus Fabrics for Efficient Multi-tenant ML
Morphlux: Transforming Torus Fabrics for Efficient Multi-tenant ML
Abhishek Vijaya Kumar
Eric Ding
Arjun Devraj
Darius Bunandar
Rachee Singh
146
0
0
20 Jul 2025
Efficient AllReduce with Stragglers
Efficient AllReduce with Stragglers
Arjun Devraj
Eric Ding
Abhishek Vijaya Kumar
Robert Kleinberg
Rachee Singh
265
0
0
29 May 2025
Towards Easy and Realistic Network Infrastructure Testing for Large-scale Machine Learning
Towards Easy and Realistic Network Infrastructure Testing for Large-scale Machine Learning
Jinsun Yoo
ChonLam Lao
Lianjie Cao
Bob Lantz
Minlan Yu
Tushar Krishna
Puneet Sharma
202
0
0
29 Apr 2025
Routing for Large ML Models
Ofir Cohen
Jose Yallouz Michael Schapira
Shahar Belkar
Tal Mizrahi
185
0
0
07 Mar 2025
MixNet: A Runtime Reconfigurable Optical-Electrical Fabric for Distributed Mixture-of-Experts Training
MixNet: A Runtime Reconfigurable Optical-Electrical Fabric for Distributed Mixture-of-Experts TrainingConference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), 2025
Xudong Liao
Yijun Sun
Han Tian
Xinchen Wan
Yilun Jin
...
Guyue Liu
Ying Zhang
Xiaofeng Ye
Yiming Zhang
Kai Chen
MoE
424
0
0
07 Jan 2025
LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale
  Model-in-Network Data-Parallel Training on Distributed GPUs
LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs
Mo Sun
Zihan Yang
Changyue Liao
Yingtao Li
Leilei Gan
Zeke Wang
312
3
0
02 Sep 2024
Efficient Training of Large Language Models on Distributed
  Infrastructures: A Survey
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
Jiangfei Duan
Shuo Zhang
Zerui Wang
Lijuan Jiang
Wenwen Qu
...
Dahua Lin
Yonggang Wen
Xin Jin
Tianwei Zhang
Yang Liu
369
32
0
29 Jul 2024
VcLLM: Video Codecs are Secretly Tensor Codecs
VcLLM: Video Codecs are Secretly Tensor Codecs
Ceyu Xu
Yongji Wu
Xinyu Yang
Beidi Chen
Matthew Lentz
Danyang Zhuo
Lisa Wu Wills
218
0
0
29 Jun 2024
Communication-Efficient Large-Scale Distributed Deep Learning: A
  Comprehensive Survey
Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey
Feng Liang
Zhen Zhang
Haifeng Lu
Victor C. M. Leung
Yanyi Guo
Xiping Hu
GNN
350
24
0
09 Apr 2024
MOPAR: A Model Partitioning Framework for Deep Learning Inference
  Services on Serverless Platforms
MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms
Jiaang Duan
Shiyou Qian
Dingyu Yang
Hanwen Hu
Jian Cao
Guangtao Xue
MoE
194
4
0
03 Apr 2024
Communication Optimization for Distributed Training: Architecture,
  Advances, and Opportunities
Communication Optimization for Distributed Training: Architecture, Advances, and OpportunitiesIEEE Network (IEEE Netw.), 2024
Yunze Wei
Tianshuo Hu
Cong Liang
Yong Cui
AI4CE
143
6
0
12 Mar 2024
MLTCP: Congestion Control for DNN Training
MLTCP: Congestion Control for DNN Training
S. Rajasekaran
Sanjoli Narang
Anton A. Zabreyko
M. Ghobadi
164
6
0
14 Feb 2024
ForestColl: Throughput-Optimal Collective Communications on Heterogeneous Network Fabrics
ForestColl: Throughput-Optimal Collective Communications on Heterogeneous Network Fabrics
Liangyu Zhao
Saeed Maleki
Ziyue Yang
Hossein Pourreza
Aashaka Shah
Changho Hwang
Arvind Krishnamurthy
231
0
0
09 Feb 2024
Swing: Short-cutting Rings for Higher Bandwidth Allreduce
Swing: Short-cutting Rings for Higher Bandwidth AllreduceSymposium on Networked Systems Design and Implementation (NSDI), 2024
Daniele De Sensi
Tommaso Bonato
D. Saam
Torsten Hoefler
204
25
0
17 Jan 2024
Credence: Augmenting Datacenter Switch Buffer Sharing with ML
  Predictions
Credence: Augmenting Datacenter Switch Buffer Sharing with ML Predictions
Vamsi Addanki
Maciej Pacut
Stefan Schmid
142
15
0
05 Jan 2024
Holmes: Towards Distributed Training Across Clusters with Heterogeneous
  NIC Environment
Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
Fei Yang
Shuang Peng
Ning Sun
Fangyu Wang
Ke Tan
Fu Wu
Jiezhong Qiu
Aimin Pan
314
10
0
06 Dec 2023
MAD Max Beyond Single-Node: Enabling Large Machine Learning Model
  Acceleration on Distributed Systems
MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed SystemsInternational Symposium on Computer Architecture (ISCA), 2023
Samuel Hsia
Alicia Golden
Bilge Acun
Newsha Ardalani
Zach DeVito
Gu-Yeon Wei
David Brooks
Carole-Jean Wu
MoE
330
14
0
04 Oct 2023
Efficient All-to-All Collective Communication Schedules for
  Direct-Connect Topologies
Efficient All-to-All Collective Communication Schedules for Direct-Connect TopologiesIEEE International Symposium on High-Performance Parallel Distributed Computing (HPDC), 2023
P. Basu
Liangyu Zhao
Jason Fantl
Siddharth Pal
Arvind Krishnamurthy
J. Khoury
195
10
0
24 Sep 2023
CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
CASSINI: Network-Aware Job Scheduling in Machine Learning ClustersSymposium on Networked Systems Design and Implementation (NSDI), 2023
S. Rajasekaran
M. Ghobadi
Aditya Akella
GNN
135
87
0
01 Aug 2023
TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed
  Machine Learning
TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine LearningMicro (MICRO), 2023
William Won
Suvinay Subramanian
Sudarshan Srinivasan
A. Durg
Samvit Kaul
Swati Gupta
Tushar Krishna
272
21
0
11 Apr 2023
TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning
  with Hardware Support for Embeddings
TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for EmbeddingsInternational Symposium on Computer Architecture (ISCA), 2023
N. Jouppi
George Kurian
Sheng Li
Peter C. Ma
R. Nagarajan
...
Brian Towles
C. Young
Xiaoping Zhou
Zongwei Zhou
David A. Patterson
BDLVLM
521
538
0
04 Apr 2023
THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic
  Compression
THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic CompressionSymposium on Networked Systems Design and Implementation (NSDI), 2023
Minghao Li
Ran Ben-Basat
S. Vargaftik
Chon-In Lao
Ke Xu
Michael Mitzenmacher
Minlan Yu Harvard University
339
26
0
16 Feb 2023
Efficient Direct-Connect Topologies for Collective Communications
Efficient Direct-Connect Topologies for Collective CommunicationsSymposium on Networked Systems Design and Implementation (NSDI), 2022
Liangyu Zhao
Siddharth Pal
Tapan Chugh
Weiyang Wang
Jason Fantl
P. Basu
J. Khoury
Arvind Krishnamurthy
385
14
0
07 Feb 2022
LIBRA: Enabling Workload-aware Multi-dimensional Network Topology
  Optimization for Distributed Training of Large AI Models
LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI ModelsIEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021
William Won
Saeed Rashidi
Sudarshan Srinivasan
T. Krishna
AI4CE
204
11
0
24 Sep 2021
1