Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2310.10046
Cited By
v1
v2
v3 (latest)
TRANSOM: An Efficient Fault-Tolerant System for Training LLMs
16 October 2023
Baodong Wu
Lei Xia
Qingping Li
Kangyu Li
Xu Chen
Yongqiang Guo
Tieyao Xiang
Yuheng Chen
Shigang Li
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"TRANSOM: An Efficient Fault-Tolerant System for Training LLMs"
9 / 9 papers shown
BitSnap: Checkpoint Sparsification and Quantization in LLM Training
Yanxin Peng
Qingping Li
Baodong Wu
Shigang Li
Guohao Dai
Shengen Yan
Yu Wang
MQ
321
0
0
15 Nov 2025
xLLM Technical Report
T. Liu
Tao Peng
Peijun Yang
X. Zhao
Xiusheng Lu
...
Tong Yang
Hailong Yang
Jing-Jing Li
Guiguang Ding
Ke Zhang
142
2
0
16 Oct 2025
FlashRecovery: Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLMs
H. Zhang
Jinxiang Wang
Zhenhua Yu
Y. Zhang
Xuejie Ji
...
Jun Yu Li
Hongwei Sun
Z. Zhang
Yi Xiong
Chunsheng Li
VLM
132
1
0
03 Sep 2025
MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
Weilin Cai
Le Qin
Jiayi Huang
MoE
188
0
0
08 Aug 2024
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
Jiangfei Duan
Shuo Zhang
Zerui Wang
Lijuan Jiang
Wenwen Qu
...
Dahua Lin
Yonggang Wen
Xin Jin
Tianwei Zhang
Yang Liu
363
31
0
29 Jul 2024
DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models
Avinash Maurya
Robert Underwood
M. Rafique
Franck Cappello
Bogdan Nicolae
226
38
0
15 Jun 2024
Unicron: Economizing Self-Healing LLM Training at Scale
Tao He
Xue Li
Zhibin Wang
Kun Qian
Jingbo Xu
Wenyuan Yu
Jingren Zhou
215
26
0
30 Dec 2023
MRFI: An Open Source Multi-Resolution Fault Injection Framework for Neural Network Processing
Haitong Huang
Cheng Liu
Bo Liu
Xing-xiong Xue
Huawei Li
Xiaowei Li
142
2
0
20 Jun 2023
TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning
Micro (MICRO), 2023
William Won
Suvinay Subramanian
Sudarshan Srinivasan
A. Durg
Samvit Kaul
Swati Gupta
Tushar Krishna
272
21
0
11 Apr 2023
1