Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM TrainingSymposium on Operating Systems Principles (SOSP), 2025 |
MAD Max Beyond Single-Node: Enabling Large Machine Learning Model
Acceleration on Distributed SystemsInternational Symposium on Computer Architecture (ISCA), 2023 |
Proteus: Simulating the Performance of Distributed DNN TrainingIEEE Transactions on Parallel and Distributed Systems (TPDS), 2023 |