34
0

Scalable Runtime Architecture for Data-driven, Hybrid HPC and ML Workflow Applications

Abstract

Hybrid workflows combining traditional HPC and novel ML methodologies are transforming scientific computing. This paper presents the architecture and implementation of a scalable runtime system that extends RADICAL-Pilot with service-based execution to support AI-out-HPC workflows. Our runtime system enables distributed ML capabilities, efficient resource management, and seamless HPC/ML coupling across local and remote platforms. Preliminary experimental results show that our approach manages concurrent execution of ML models across local and remote HPC/cloud resources with minimal architectural overheads. This lays the foundation for prototyping three representative data-driven workflow applications and executing them at scale on leadership-class HPC platforms.

View on arXiv
@article{merzky2025_2503.13343,
  title={ Scalable Runtime Architecture for Data-driven, Hybrid HPC and ML Workflow Applications },
  author={ Andre Merzky and Mikhail Titov and Matteo Turilli and Ozgur Kilic and Tianle Wang and Shantenu Jha },
  journal={arXiv preprint arXiv:2503.13343},
  year={ 2025 }
}
Comments on this paper