ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.12815
19
0

Learning in Chaos: Efficient Autoscaling and Self-healing for Distributed Training at the Edge

19 May 2025
Wenjiao Feng
Rongxing Xiao
Zonghang Li
Hongfang Yu
Gang Sun
Long Luo
Mohsen Guizani
Qirong Ho
ArXivPDFHTML
Abstract

Frequent node and link changes in edge AI clusters disrupt distributed training, while traditional checkpoint-based recovery and cloud-centric autoscaling are too slow for scale-out and ill-suited to chaotic and self-governed edge. This paper proposes Chaos, a resilient and scalable edge distributed training system with built-in self-healing and autoscaling. It speeds up scale-out by using multi-neighbor replication with fast shard scheduling, allowing a new node to pull the latest training state from nearby neighbors in parallel while balancing the traffic load between them. It also uses a cluster monitor to track resource and topology changes to assist scheduler decisions, and handles scaling events through peer negotiation protocols, enabling fully self-governed autoscaling without a central admin. Extensive experiments show that Chaos consistently achieves much lower scale-out delays than Pollux, EDL, and Autoscaling, and handles scale-in, connect-link, and disconnect-link events within 1 millisecond, making it smoother to handle node joins, exits, and failures. It also delivers the lowest idle time, showing superior resource use and scalability as the cluster grows.

View on arXiv
@article{feng2025_2505.12815,
  title={ Learning in Chaos: Efficient Autoscaling and Self-healing for Distributed Training at the Edge },
  author={ Wenjiao Feng and Rongxing Xiao and Zonghang Li and Hongfang Yu and Gang Sun and Long Luo and Mohsen Guizani and Qirong Ho },
  journal={arXiv preprint arXiv:2505.12815},
  year={ 2025 }
}
Comments on this paper