ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.21680
  4. Cited By
Revisiting Reliability in Large-Scale Machine Learning Research Clusters

Revisiting Reliability in Large-Scale Machine Learning Research Clusters

29 October 2024
Apostolos Kokolis
Michael Kuchnik
John Hoffman
Adithya Kumar
Parth Malani
Faye Ma
Zachary DeVito
S.
Kalyan Saladi
Carole-Jean Wu
ArXivPDFHTML

Papers citing "Revisiting Reliability in Large-Scale Machine Learning Research Clusters"

2 / 2 papers shown
Title
Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training
Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training
Daiyaan Arfeen
Dheevatsa Mudigere
Ankit More
Bhargava Gopireddy
Ahmet Inci
G. R. Ganger
18
0
0
08 Apr 2025
Characterizing GPU Resilience and Impact on AI/HPC Systems
Characterizing GPU Resilience and Impact on AI/HPC Systems
Shengkun Cui
Archit Patke
Ziheng Chen
Aditya Ranjan
Hung Nguyen
...
Chandra Narayanaswami
Daby M. Sow
C. Martino
Zbigniew T. Kalbarczyk
R. Iyer
29
0
0
14 Mar 2025
1