Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2309.08125
Cited By
v1
v2 (latest)
Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates
15 September 2023
Insu Jang
Zhenning Yang
Zhen Zhang
Xin Jin
Mosharaf Chowdhury
MoE
AI4CE
OODD
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates"
25 / 25 papers shown
Title
TrainVerify: Equivalence-Based Verification for Distributed LLM Training
Yunchi Lu
Youshan Miao
Cheng Tan
Peng Huang
Yi Zhu
Xian Zhang
Fan Yang
LRM
7
0
0
19 Jun 2025
All is Not Lost: LLM Recovery without Checkpoints
Nikolay Blagoev
Oğuzhan Ersoy
Lydia Yiyu Chen
24
0
0
18 Jun 2025
Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks
Yuxuan Jiang
Ziming Zhou
Boyu Xu
Beijie Liu
Runhui Xu
Peng Huang
12
0
0
06 Jun 2025
Learning in Chaos: Efficient Autoscaling and Self-healing for Distributed Training at the Edge
Wenjiao Feng
Rongxing Xiao
Zonghang Li
Hongfang Yu
Gang Sun
Long Luo
Mohsen Guizani
Qirong Ho
58
0
0
19 May 2025
Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training
Daiyaan Arfeen
Dheevatsa Mudigere
Ankit More
Bhargava Gopireddy
Ahmet Inci
G. R. Ganger
50
0
0
08 Apr 2025
Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
Yijie Zheng
Bangjun Xiao
Lei Shi
Xiaoyang Li
Faming Wu
Tianyu Li
Xuefeng Xiao
Yanzhe Zhang
Yansen Wang
Shouda Liu
MLLM
MoE
139
1
0
31 Mar 2025
Stealing Training Data from Large Language Models in Decentralized Training through Activation Inversion Attack
Chenxi Dai
Lin Lu
Pan Zhou
99
0
0
22 Feb 2025
Orthogonal Calibration for Asynchronous Federated Learning
Jiayun Zhang
Shuheng Li
Haiyu Huang
Xiaofan Yu
Rajesh K. Gupta
Jingbo Shang
FedML
101
0
0
21 Feb 2025
Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization
Haoyang Li
Fangcheng Fu
Hao Ge
Sheng Lin
Xuanyu Wang
Jiawen Niu
Yijiao Wang
Hailin Zhang
Xiaonan Nie
Tengjiao Wang
MoMe
92
2
0
17 Oct 2024
FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training
Tianyuan Wu
Wei Wang
Yinghao Yu
Siran Yang
Wenchao Wu
Qinkai Duan
Guodong Yang
Jiamang Wang
Lin Qu
Liping Zhang
73
8
0
16 Oct 2024
Enhancing Robustness in Deep Reinforcement Learning: A Lyapunov Exponent Approach
Rory Young
Nicolas Pugeault
AAML
134
5
0
14 Oct 2024
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng
Chi Zhang
Zilingfeng Ye
Xibin Wu
Wang Zhang
Ru Zhang
Size Zheng
Haibin Lin
Chuan Wu
AI4CE
203
240
0
28 Sep 2024
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
Jiangfei Duan
Shuo Zhang
Zerui Wang
Lijuan Jiang
Wenwen Qu
...
Dahua Lin
Yonggang Wen
Xin Jin
Tianwei Zhang
Peng Sun
143
12
0
29 Jul 2024
Enabling Elastic Model Serving with MultiWorld
Myungjin Lee
Akshay Jajoo
Ramana Rao Kompella
MoE
97
0
0
12 Jul 2024
Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
Yongji Wu
Wenjie Qu
Tianyang Tao
Zhuang Wang
Wei Bai
Zhuohao Li
Yuan Tian
Jiaheng Zhang
Matthew Lentz
Danyang Zhuo
94
3
0
05 Jul 2024
Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey
Feng Liang
Zhen Zhang
Haifeng Lu
Chengming Li
Victor C. M. Leung
Yanyi Guo
Xiping Hu
97
5
0
12 Jun 2024
SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures
Swapnil Gandhi
Mark Zhao
Athinagoras Skiadopoulos
Christos Kozyrakis
AI4CE
GNN
64
1
0
22 May 2024
Toward Cross-Layer Energy Optimizations in Machine Learning Systems
Jae-Won Chung
Mosharaf Chowdhury
65
0
0
10 Apr 2024
Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey
Feng Liang
Zhen Zhang
Haifeng Lu
Victor C. M. Leung
Yanyi Guo
Xiping Hu
GNN
103
8
0
09 Apr 2024
Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
Jiangfei Duan
Ziang Song
Xupeng Miao
Xiaoli Xi
Dahua Lin
Harry Xu
Minjia Zhang
Zhihao Jia
77
11
0
21 Mar 2024
Characterization of Large Language Model Development in the Datacenter
Qi Hu
Zhisheng Ye
Zerui Wang
Guoteng Wang
Mengdie Zhang
...
Dahua Lin
Xiaolin Wang
Yingwei Luo
Yonggang Wen
Tianwei Zhang
94
49
0
12 Mar 2024
Unicron: Economizing Self-Healing LLM Training at Scale
Tao He
Xue Li
Zhibin Wang
Kun Qian
Jingbo Xu
Wenyuan Yu
Jingren Zhou
57
15
0
30 Dec 2023
Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections
Marcel Wagenlander
Guo Li
Bo Zhao
Kai Zou
Peter R. Pietzuch
93
7
0
08 Dec 2023
Exploring the Robustness of Decentralized Training for Large Language Models
Lin Lu
Chenxi Dai
Wangcheng Tao
Binhang Yuan
Yanan Sun
Pan Zhou
77
1
0
01 Dec 2023
TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning
William Won
Suvinay Subramanian
Sudarshan Srinivasan
A. Durg
Samvit Kaul
Swati Gupta
Tushar Krishna
86
7
0
11 Apr 2023
1