Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2410.21680
Cited By
Revisiting Reliability in Large-Scale Machine Learning Research Clusters
29 October 2024
Apostolos Kokolis
Michael Kuchnik
John Hoffman
Adithya Kumar
Parth Malani
Faye Ma
Zachary DeVito
S.
Kalyan Saladi
Carole-Jean Wu
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Revisiting Reliability in Large-Scale Machine Learning Research Clusters"
2 / 2 papers shown
Title
Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training
Daiyaan Arfeen
Dheevatsa Mudigere
Ankit More
Bhargava Gopireddy
Ahmet Inci
G. R. Ganger
18
0
0
08 Apr 2025
Characterizing GPU Resilience and Impact on AI/HPC Systems
Shengkun Cui
Archit Patke
Ziheng Chen
Aditya Ranjan
Hung Nguyen
...
Chandra Narayanaswami
Daby M. Sow
C. Martino
Zbigniew T. Kalbarczyk
R. Iyer
29
0
0
14 Mar 2025
1