RHAPSODY：大规模混合AI-HPC工作流的执行 (RHAPSODY: Execution of Hybrid AI-HPC Workflows at Scale)

Aymen Alsaadi,Mason Hooten,Mariya Goliyad,Andre Merzky,Andrew Shao,Mikhail Titov,Tianle Wang,Yian Chen,Maria Kalantzi,Kent Lee,Andrew Park,Indira Pimpalkhare,Nick Radcliffe,Colin Wahl,Pete Mendygral,Matteo Turilli,Shantenu Jha

Hybrid AI-HPC workflows combine large-scale simulation, training, high-throughput inference, and tightly coupled, agent-driven control within a single execution campaign. These workflows impose heterogeneous and often conflicting requirements on runtime systems, spanning MPI executables, persistent AI services, fine-grained tasks, and low-latency AI-HPC coupling. Existing systems typically address only subsets of these requirements, limiting their ability to support emerging AI-HPC applications at scale. We present RHAPSODY, a multi-runtime middleware that enables concurrent execution of heterogeneous AI-HPC workloads through uniform abstractions for tasks, services, resources, and execution policies. Rather than replacing existing runtimes, RHAPSODY composes and coordinates them, allowing simulation codes, inference services, and agentic workflows to coexist within a single job allocation on leadership-class HPC platforms. We evaluate RHAPSODY with Dragon and vLLM on multiple HPC systems using representative heterogeneous, inference-at-scale, and tightly coupled AI-HPC workflows. Our results show that RHAPSODY introduces minimal runtime overhead, sustains increasing heterogeneity at scale, achieves near-linear scaling for high-throughput inference workloads, and data- and control-efficient coupling between AI and HPC tasks in agentic workflows.

翻译：混合AI-HPC工作流将大规模模拟、训练、高吞吐量推理以及紧密耦合的智能体驱动控制结合在单一执行任务中。这类工作流对运行时系统提出了异构且常常相互冲突的要求，涵盖MPI可执行程序、持久化AI服务、细粒度任务以及低延迟的AI-HPC耦合。现有系统通常仅能解决部分需求，限制了其大规模支持新兴AI-HPC应用的能力。我们提出了RHAPSODY，一种多运行时中间件，它通过任务、服务、资源和执行策略的统一抽象，支持异构AI-HPC工作负载的并发执行。RHAPSODY并非取代现有运行时，而是对它们进行组合与协调，使得模拟代码、推理服务和智能体工作流能够在领先级HPC平台的单一作业分配中共存。我们使用具有代表性的异构工作流、大规模推理工作流以及紧密耦合的AI-HPC工作流，在多个HPC系统上结合Dragon和vLLM对RHAPSODY进行了评估。结果表明，RHAPSODY引入了极小的运行时开销，能够持续支持规模不断增长的异构性，为高吞吐量推理工作负载实现近乎线性的扩展，并在智能体工作流中实现AI与HPC任务间数据与控制的高效耦合。

相关内容

关注 7076

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

Transformer 落地出现 | Next-ViT实现工业TensorRT实时落地，超越ResNet、CSWin

专知会员服务

22+阅读 · 2022年7月19日

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日

【CVPR 2022】MixFormer：跨窗口与维度的特征融合，MixFormer: Mixing Features across Windows and Dimensions

专知会员服务

15+阅读 · 2022年3月19日

Python图像处理，366页pdf，Image Operators Image Processing in Python

专知会员服务

78+阅读 · 2020年7月23日