Hybrid AI-HPC workflows combine large-scale simulation, training, high-throughput inference, and tightly coupled, agent-driven control within a single execution campaign. These workflows impose heterogeneous and often conflicting requirements on runtime systems, spanning MPI executables, persistent AI services, fine-grained tasks, and low-latency AI-HPC coupling. Existing systems typically address only subsets of these requirements, limiting their ability to support emerging AI-HPC applications at scale. We present RHAPSODY, a multi-runtime middleware that enables concurrent execution of heterogeneous AI-HPC workloads through uniform abstractions for tasks, services, resources, and execution policies. Rather than replacing existing runtimes, RHAPSODY composes and coordinates them, allowing simulation codes, inference services, and agentic workflows to coexist within a single job allocation on leadership-class HPC platforms. We evaluate RHAPSODY with Dragon and vLLM on multiple HPC systems using representative heterogeneous, inference-at-scale, and tightly coupled AI-HPC workflows. Our results show that RHAPSODY introduces minimal runtime overhead, sustains increasing heterogeneity at scale, achieves near-linear scaling for high-throughput inference workloads, and data- and control-efficient coupling between AI and HPC tasks in agentic workflows.
翻译:混合AI-HPC工作流将大规模模拟、训练、高吞吐量推理以及紧密耦合的智能体驱动控制结合在单一执行任务中。这类工作流对运行时系统提出了异构且常常相互冲突的要求,涵盖MPI可执行程序、持久化AI服务、细粒度任务以及低延迟的AI-HPC耦合。现有系统通常仅能解决部分需求,限制了其大规模支持新兴AI-HPC应用的能力。我们提出了RHAPSODY,一种多运行时中间件,它通过任务、服务、资源和执行策略的统一抽象,支持异构AI-HPC工作负载的并发执行。RHAPSODY并非取代现有运行时,而是对它们进行组合与协调,使得模拟代码、推理服务和智能体工作流能够在领先级HPC平台的单一作业分配中共存。我们使用具有代表性的异构工作流、大规模推理工作流以及紧密耦合的AI-HPC工作流,在多个HPC系统上结合Dragon和vLLM对RHAPSODY进行了评估。结果表明,RHAPSODY引入了极小的运行时开销,能够持续支持规模不断增长的异构性,为高吞吐量推理工作负载实现近乎线性的扩展,并在智能体工作流中实现AI与HPC任务间数据与控制的高效耦合。