Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2403.07648
Cited By
Characterization of Large Language Model Development in the Datacenter
12 March 2024
Qi Hu
Zhisheng Ye
Zerui Wang
Guoteng Wang
Mengdie Zhang
Qiaoling Chen
Peng Sun
Dahua Lin
Xiaolin Wang
Yingwei Luo
Yonggang Wen
Tianwei Zhang
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Characterization of Large Language Model Development in the Datacenter"
14 / 14 papers shown
Title
SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models
Hang Wu
Jianian Zhu
Y. Li
Haojie Wang
Biao Hou
Jidong Zhai
25
0
0
12 May 2025
Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
Yijie Zheng
Bangjun Xiao
Lei Shi
Xiaoyang Li
Faming Wu
Tianyu Li
Xuefeng Xiao
Y. Zhang
Y. Wang
Shouda Liu
MLLM
MoE
64
1
0
31 Mar 2025
Green Prompting
Marta Adamska
Daria Smirnova
Hamid Nasiri
Zhengxin Yu
Peter Garraghan
81
0
0
09 Mar 2025
Minder: Faulty Machine Detection for Large-scale Distributed Model Training
Yangtao Deng
Xiang Shi
Zhuo Jiang
X. Zhang
Lei Zhang
...
Fuliang Li
Shuguang Wang
H. Lin
Jianxi Ye
Minlan Yu
LRM
76
2
0
04 Nov 2024
Revisiting Reliability in Large-Scale Machine Learning Research Clusters
Apostolos Kokolis
Michael Kuchnik
John Hoffman
Adithya Kumar
Parth Malani
Faye Ma
Zachary DeVito
S.
Kalyan Saladi
Carole-Jean Wu
68
7
0
29 Oct 2024
A Survey on Failure Analysis and Fault Injection in AI Systems
Guangba Yu
Gou Tan
Haojia Huang
Zhenyu Zhang
Pengfei Chen
Roberto Natella
Zibin Zheng
34
3
0
28 Jun 2024
SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation
Yifan Xiong
Yuting Jiang
Ziyue Yang
L. Qu
Guoshuai Zhao
...
Luke Melton
Joe Chau
Peng Cheng
Yongqiang Xiong
Lidong Zhou
47
6
0
09 Feb 2024
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang
Jason W. Wei
Dale Schuurmans
Quoc Le
Ed H. Chi
Sharan Narang
Aakanksha Chowdhery
Denny Zhou
ReLM
BDL
LRM
AI4CE
297
3,163
0
21 Mar 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
303
11,730
0
04 Mar 2022
Varuna: Scalable, Low-cost Training of Massive Deep Learning Models
Sanjith Athlur
Nitika Saran
Muthian Sivathanu
R. Ramjee
Nipun Kwatra
GNN
28
79
0
07 Nov 2021
Carbon Emissions and Large Neural Network Training
David A. Patterson
Joseph E. Gonzalez
Quoc V. Le
Chen Liang
Lluís-Miquel Munguía
D. Rothchild
David R. So
Maud Texier
J. Dean
AI4CE
239
626
0
21 Apr 2021
tf.data: A Machine Learning Data Processing Framework
D. Murray
Jiří Šimša
Ana Klimovic
Ihor Indyk
PINN
AI4CE
LMTD
39
86
0
28 Jan 2021
ZeRO-Offload: Democratizing Billion-Scale Model Training
Jie Ren
Samyam Rajbhandari
Reza Yazdani Aminabadi
Olatunji Ruwase
Shuangyang Yang
Minjia Zhang
Dong Li
Yuxiong He
MoE
160
399
0
18 Jan 2021
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
243
1,791
0
17 Sep 2019
1