ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.10046
  4. Cited By
TRANSOM: An Efficient Fault-Tolerant System for Training LLMs
v1v2v3 (latest)

TRANSOM: An Efficient Fault-Tolerant System for Training LLMs

16 October 2023
Baodong Wu
Lei Xia
Qingping Li
Kangyu Li
Xu Chen
Yongqiang Guo
Tieyao Xiang
Yuheng Chen
Shigang Li
ArXiv (abs)PDFHTML

Papers citing "TRANSOM: An Efficient Fault-Tolerant System for Training LLMs"

9 / 9 papers shown
BitSnap: Checkpoint Sparsification and Quantization in LLM Training
BitSnap: Checkpoint Sparsification and Quantization in LLM Training
Yanxin Peng
Qingping Li
Baodong Wu
Shigang Li
Guohao Dai
Shengen Yan
Yu Wang
MQ
321
0
0
15 Nov 2025
xLLM Technical Report
xLLM Technical Report
T. Liu
Tao Peng
Peijun Yang
X. Zhao
Xiusheng Lu
...
Tong Yang
Hailong Yang
Jing-Jing Li
Guiguang Ding
Ke Zhang
142
2
0
16 Oct 2025
FlashRecovery: Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLMs
FlashRecovery: Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLMs
H. Zhang
Jinxiang Wang
Zhenhua Yu
Y. Zhang
Xuejie Ji
...
Jun Yu Li
Hongwei Sun
Z. Zhang
Yi Xiong
Chunsheng Li
VLM
132
1
0
03 Sep 2025
MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training
MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model TrainingInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
Weilin Cai
Le Qin
Jiayi Huang
MoE
188
0
0
08 Aug 2024
Efficient Training of Large Language Models on Distributed
  Infrastructures: A Survey
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
Jiangfei Duan
Shuo Zhang
Zerui Wang
Lijuan Jiang
Wenwen Qu
...
Dahua Lin
Yonggang Wen
Xin Jin
Tianwei Zhang
Yang Liu
363
31
0
29 Jul 2024
DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language
  Models
DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models
Avinash Maurya
Robert Underwood
M. Rafique
Franck Cappello
Bogdan Nicolae
226
38
0
15 Jun 2024
Unicron: Economizing Self-Healing LLM Training at Scale
Unicron: Economizing Self-Healing LLM Training at Scale
Tao He
Xue Li
Zhibin Wang
Kun Qian
Jingbo Xu
Wenyuan Yu
Jingren Zhou
215
26
0
30 Dec 2023
MRFI: An Open Source Multi-Resolution Fault Injection Framework for
  Neural Network Processing
MRFI: An Open Source Multi-Resolution Fault Injection Framework for Neural Network Processing
Haitong Huang
Cheng Liu
Bo Liu
Xing-xiong Xue
Huawei Li
Xiaowei Li
142
2
0
20 Jun 2023
TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed
  Machine Learning
TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine LearningMicro (MICRO), 2023
William Won
Suvinay Subramanian
Sudarshan Srinivasan
A. Durg
Samvit Kaul
Swati Gupta
Tushar Krishna
272
21
0
11 Apr 2023
1