Massive upgrades to science infrastructure are driving data velocities upwards while stimulating adoption of increasingly data-intensive analytics. While next-generation exascale supercomputers promise strong support for I/O-intensive workflows, HPC remains largely untapped by live experiments, because data transfers and disparate batch-queueing policies are prohibitive when faced with scarce instrument time. To bridge this divide, we introduce Balsam: a distributed orchestration platform enabling workflows at the edge to securely and efficiently trigger analytics tasks across a user-managed federation of HPC execution sites. We describe the architecture of the Balsam service, which provides a workflow management API, and distributed sites that provision resources and schedule scalable, fault-tolerant execution. We demonstrate Balsam in efficiently scaling real-time analytics from two DOE light sources simultaneously onto three supercomputers (Theta, Summit, and Cori), while maintaining low overheads for on-demand computing, and providing a Python library for seamless integration with existing ecosystems of data analysis tools.
翻译:科学基础设施的大规模升级正在推动数据速度的上升,同时刺激采用日益数据密集型分析方法。虽然下一代的先进超级计算机承诺大力支持I/O密集型工作流程,但HPC仍然基本上没有被现场实验利用,因为面对紧缺的仪器时间,数据传输和分散的批量排解政策令人望而却步。为了弥合这一鸿沟,我们引入了Balsam:一个分布式的管弦平台,使边缘的工作流程能够安全有效地触发由用户管理的HPC执行点联合会之间的分析任务。我们描述了提供工作流程管理API的Balsam服务结构,以及提供资源和可缩放、容错执行时间表的分布点。我们展示了Balsam在将两个DOE光源的实时分析器同时推广到三个超级计算机(Theta、Celead和Cori),同时保持低按需计算间接费用,并为数据分析工具与现有生态系统的无缝合而提供Python图书馆。