eACGM: Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems
We present eACGM, a full-stack AI/ML system monitoring framework based on eBPF. eACGM collects real-time performance data from key hardware components, including the GPU and network communication layer, as well as from key software stacks such as CUDA, Python, and PyTorch, all without requiring any code instrumentation or modifications. Additionally, it leverages libnvml to gather process-level GPU resource usage information. By applying a Gaussian Mixture Model (GMM) to the collected multidimensional performance metrics for statistical modeling and clustering analysis, eACGM effectively identifies complex failure modes, such as latency anomalies, hardware failures, and communication inefficiencies, enabling rapid diagnosis of system bottlenecks and abnormal behaviors.
View on arXiv