Executing scientific workflows with heterogeneous tasks on HPC platforms poses several challenges which will be further exacerbated by the upcoming exascale platforms. At that scale, bespoke solutions will not enable effective and efficient workflow executions. In preparation, we need to look at ways to manage engineering effort and capability duplication across software systems by integrating independently developed, production-grade software solutions. In this paper, we integrate RADICAL-Pilot (RP) and Parsl and develop an MPI executor to enable the execution of workflows with heterogeneous (non)MPI Python functions at scale. We characterize the strong and weak scaling of the integrated RP-Parsl system when executing two use cases from polar science, and of the function executor on both SDSC Comet and TACC Frontera. We gain engineering insight about how to analyze and integrate workflow and runtime systems, minimizing changes in their code bases and overall development effort. Our experiments show that the overheads of the integrated system are invariant of resource and workflow scale, and measure the impact of diverse MPI overheads. Together, those results define a blueprint towards an ecosystem populated by specialized, efficient, effective and independently-maintained software systems to face the upcoming scaling challenges.
翻译:执行具有不同任务的科学工作流程,HPC平台上的各种任务将面临若干挑战,这些挑战将因即将到来的扩展平台而进一步加剧。在这个规模上,简单的解决方案将无法促成有效和高效的工作流程执行。在准备过程中,我们需要研究如何通过整合独立开发的、生产级的软件解决方案来管理软件系统之间的工程努力和能力重复。在本文件中,我们整合了RADC-Pilot(RP)和Parsl(RAD-Pilot)和Parsl(Parsl),并开发了MPI执行器,以便能够执行具有不同(non)MPI Python功能的大规模工作流程。我们在实施极地科学的两个使用案例时,以及SDSC Compt和TACC Fronterera两个功能执行器的功能执行器时,我们特征是强而薄弱的。我们从工程角度深入了解如何分析和整合工作流程和运行时间系统,尽量减少其代码基础和总体开发工作的变化。我们的实验表明,综合系统的管理器在资源和工作流程规模上是变化不定的,并衡量多种MPI管理器的影响。这些结果共同界定了通过专业化、高效、有效和独立软件逐步提升生态系统所面临的挑战。