Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2204.12013
Cited By
Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs
26 April 2022
John Thorpe
Pengzhan Zhao
Jon Eyolfson
Yifan Qiao
Zhihao Jia
Minjia Zhang
Ravi Netravali
Guoqing Harry Xu
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs"
26 / 26 papers shown
Title
All is Not Lost: LLM Recovery without Checkpoints
Nikolay Blagoev
Oğuzhan Ersoy
Lydia Yiyu Chen
22
0
0
18 Jun 2025
Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks
Yuxuan Jiang
Ziming Zhou
Boyu Xu
Beijie Liu
Runhui Xu
Peng Huang
5
0
0
06 Jun 2025
Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
Yijie Zheng
Bangjun Xiao
Lei Shi
Xiaoyang Li
Faming Wu
Tianyu Li
Xuefeng Xiao
Yanzhe Zhang
Yansen Wang
Shouda Liu
MLLM
MoE
139
1
0
31 Mar 2025
Stealing Training Data from Large Language Models in Decentralized Training through Activation Inversion Attack
Chenxi Dai
Lin Lu
Pan Zhou
99
0
0
22 Feb 2025
Real-time and Downtime-tolerant Fault Diagnosis for Railway Turnout Machines (RTMs) Empowered with Cloud-Edge Pipeline Parallelism
Fan Wu
Muhammad Bilal
Haolong Xiang
Heng Wang
Jinjun Yu
Xiaolong Xu
46
0
0
04 Nov 2024
SkyServe: Serving AI Models across Regions and Clouds with Spot Instances
Ziming Mao
Tian Xia
Zhanghao Wu
Wei-Lin Chiang
Tyler Griggs
Romil Bhardwaj
Zongheng Yang
S. Shenker
Ion Stoica
200
3
0
03 Nov 2024
Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization
Haoyang Li
Fangcheng Fu
Hao Ge
Sheng Lin
Xuanyu Wang
Jiawen Niu
Yijiao Wang
Hailin Zhang
Xiaonan Nie
Tengjiao Wang
MoMe
92
2
0
17 Oct 2024
FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training
Tianyuan Wu
Wei Wang
Yinghao Yu
Siran Yang
Wenchao Wu
Qinkai Duan
Guodong Yang
Jiamang Wang
Lin Qu
Liping Zhang
73
8
0
16 Oct 2024
PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training
Daiyaan Arfeen
Zhen Zhang
Xinwei Fu
G. R. Ganger
Yida Wang
AI4CE
36
0
0
23 Sep 2024
FreeRide: Harvesting Bubbles in Pipeline Parallelism
Jiashu Zhang
Zihan Pan
Molly
Xu
Khuzaima S. Daudjee
147
0
0
11 Sep 2024
Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
Weiqi Feng
Yangrui Chen
Shaoyu Wang
Size Zheng
H. Lin
Minlan Yu
MLLM
AI4CE
138
4
0
07 Aug 2024
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
Jiangfei Duan
Shuo Zhang
Zerui Wang
Lijuan Jiang
Wenwen Qu
...
Dahua Lin
Yonggang Wen
Xin Jin
Tianwei Zhang
Peng Sun
141
12
0
29 Jul 2024
Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
Yongji Wu
Wenjie Qu
Tianyang Tao
Zhuang Wang
Wei Bai
Zhuohao Li
Yuan Tian
Jiaheng Zhang
Matthew Lentz
Danyang Zhuo
94
3
0
05 Jul 2024
VcLLM: Video Codecs are Secretly Tensor Codecs
Ceyu Xu
Yongji Wu
Xinyu Yang
Beidi Chen
Matthew Lentz
Danyang Zhuo
Lisa Wu Wills
102
0
0
29 Jun 2024
SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures
Swapnil Gandhi
Mark Zhao
Athinagoras Skiadopoulos
Christos Kozyrakis
AI4CE
GNN
64
9
0
22 May 2024
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
Tyler Griggs
Xiaoxuan Liu
Jiaxiang Yu
Doyoung Kim
Wei-Lin Chiang
Alvin Cheung
Ion Stoica
114
18
0
22 Apr 2024
Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
Jiangfei Duan
Ziang Song
Xupeng Miao
Xiaoli Xi
Dahua Lin
Harry Xu
Minjia Zhang
Zhihao Jia
77
11
0
21 Mar 2024
Characterization of Large Language Model Development in the Datacenter
Qi Hu
Zhisheng Ye
Zerui Wang
Guoteng Wang
Mengdie Zhang
...
Dahua Lin
Xiaolin Wang
Yingwei Luo
Yonggang Wen
Tianwei Zhang
94
49
0
12 Mar 2024
Towards providing reliable job completion time predictions using PCS
Abdullah Bin Faisal
Noah Martin
Hafiz Mohsin Bashir
Swaminathan Lamelas
Fahad R. Dogar
52
0
0
18 Jan 2024
Training and Serving System of Foundation Models: A Comprehensive Survey
Jiahang Zhou
Yanyu Chen
Zicong Hong
Wuhui Chen
Yue Yu
Tao Zhang
Hui Wang
Chuan-fu Zhang
Zibin Zheng
ALM
89
10
0
05 Jan 2024
Unicron: Economizing Self-Healing LLM Training at Scale
Tao He
Xue Li
Zhibin Wang
Kun Qian
Jingbo Xu
Wenyuan Yu
Jingren Zhou
57
15
0
30 Dec 2023
Exploring the Robustness of Decentralized Training for Large Language Models
Lin Lu
Chenxi Dai
Wangcheng Tao
Binhang Yuan
Yanan Sun
Pan Zhou
77
1
0
01 Dec 2023
SpotServe: Serving Generative Large Language Models on Preemptible Instances
Xupeng Miao
Chunan Shi
Jiangfei Duan
Xiaoli Xi
Dahua Lin
Tengjiao Wang
Zhihao Jia
VLM
72
63
0
27 Nov 2023
Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors
Chengming Zhang
Baixi Sun
Xiaodong Yu
Zhen Xie
Weijian Zheng
K. Iskra
Pete Beckman
Dingwen Tao
50
5
0
29 Sep 2023
Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates
Insu Jang
Zhenning Yang
Zhen Zhang
Xin Jin
Mosharaf Chowdhury
MoE
AI4CE
OODD
105
47
0
15 Sep 2023
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient
Max Ryabinin
Tim Dettmers
Michael Diskin
Alexander Borzunov
MoE
109
38
0
27 Jan 2023
1