Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2011.02999
Cited By
CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
5 November 2020
Kiwan Maeng
Shivam Bharuka
Isabel Gao
M. C. Jeffrey
V. Saraph
Bor-Yiing Su
Caroline Trippel
Jiyan Yang
Michael G. Rabbat
Brandon Lucia
Carole-Jean Wu
OffRL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery"
14 / 14 papers shown
Title
Lossless Compression for LLM Tensor Incremental Snapshots
Daniel Waddington
Cornel Constantinescu
9
0
0
14 May 2025
Revisiting Reliability in Large-Scale Machine Learning Research Clusters
Apostolos Kokolis
Michael Kuchnik
John Hoffman
Adithya Kumar
Parth Malani
Faye Ma
Zachary DeVito
Shri Kiran Srinivasan
Kalyan Saladi
Carole-Jean Wu
151
7
0
29 Oct 2024
Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
Yongji Wu
Wenjie Qu
Tianyang Tao
Zhuang Wang
Wei Bai
Zhuohao Li
Yuan Tian
Jiaheng Zhang
Matthew Lentz
Danyang Zhuo
66
3
0
05 Jul 2024
Characterization of Large Language Model Development in the Datacenter
Qi Hu
Zhisheng Ye
Zerui Wang
Guoteng Wang
Mengdie Zhang
...
Dahua Lin
Xiaolin Wang
Yingwei Luo
Yonggang Wen
Tianwei Zhang
56
43
0
12 Mar 2024
Reliable and Efficient In-Memory Fault Tolerance of Large Language Model Pretraining
Yuxin Wang
S. Shi
Xin He
Zhenheng Tang
Xinglin Pan
Yang Zheng
Xiaoyu Wu
Amelie Chi Zhou
Bingsheng He
Xiaowen Chu
KELM
32
4
0
19 Oct 2023
TRANSOM: An Efficient Fault-Tolerant System for Training LLMs
Baodong Wu
Lei Xia
Qingping Li
Kangyu Li
Xu Chen
Yongqiang Guo
Tieyao Xiang
Yuheng Chen
Shigang Li
34
11
0
16 Oct 2023
SWIFT: Expedited Failure Recovery for Large-scale DNN Training
Keon Jang
Hassan M. G. Wassel
Behnam Montazeri
Michael Ryan
David Wetherall
17
8
0
13 Feb 2023
RecShard: Statistical Feature-Based Memory Optimization for Industry-Scale Neural Recommendation
Geet Sethi
Bilge Acun
Niket Agarwal
Christos Kozyrakis
Caroline Trippel
Carole-Jean Wu
47
66
0
25 Jan 2022
Sustainable AI: Environmental Implications, Challenges and Opportunities
Carole-Jean Wu
Ramya Raghavendra
Udit Gupta
Bilge Acun
Newsha Ardalani
...
Maximilian Balandat
Joe Spisak
R. Jain
Michael G. Rabbat
K. Hazelwood
45
381
0
30 Oct 2021
Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
Mark Zhao
Niket Agarwal
Aarti Basant
B. Gedik
Satadru Pan
...
Kevin Wilfong
Harsha Rastogi
Carole-Jean Wu
Christos Kozyrakis
Parikshit Pol
GNN
26
70
0
20 Aug 2021
ECRM: Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding
Kaige Liu
J. Kosaian
K. V. Rashmi
22
4
0
05 Apr 2021
Understanding Training Efficiency of Deep Learning Recommendation Models at Scale
Bilge Acun
Matthew Murphy
Xiaodong Wang
Jade Nie
Carole-Jean Wu
K. Hazelwood
23
109
0
11 Nov 2020
Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems
Weijie Zhao
Deping Xie
Ronglai Jia
Yulei Qian
Rui Ding
Mingming Sun
P. Li
MoE
59
150
0
12 Mar 2020
RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing
Liu Ke
Udit Gupta
Carole-Jean Wu
B. Cho
Mark Hempstead
...
Dheevatsa Mudigere
Maxim Naumov
Martin D. Schatz
M. Smelyanskiy
Xiaodong Wang
46
213
0
30 Dec 2019
1