Many extreme scale scientific applications have workloads comprised of a large number of individual high-performance tasks. The Pilot abstraction decouples workload specification, resource management, and task execution via job placeholders and late-binding. As such, suitable implementations of the Pilot abstraction can support the collective execution of large number of tasks on supercomputers. We introduce RADICAL-Pilot (RP) as a portable, modular and extensible Pilot enabled runtime system. We describe RP's design, architecture and implementation. We characterize its performance and show its ability to scalably execute workloads comprised of tens of thousands heterogeneous tasks on DOE and NSF leadership-class HPC platforms. Specifically, we investigate RP's weak/strong scaling with CPU/GPU, single/multi core, (non)MPI tasks and python functions when using most of ORNL Summit and TACC Frontera. RADICAL-Pilot can be used stand-alone, as well as the runtime for third-party workflow systems.
翻译:许多极端规模的科学应用都有由大量个人高性能任务构成的工作量。试点抽取性分解工作量规格、资源管理以及通过职位持有者和有后期约束性的工作执行任务。因此,适当执行试点抽取性能可以支持在超级计算机上集体执行大量任务。我们引入了RADC-Pilot(RP)作为便携式、模块化和可扩展的试点启用运行时间系统。我们描述了RP的设计、架构和执行。我们描述它的绩效并显示它有能力在DOE和NSF领导级HPC平台上完成数万项不同任务。具体地说,我们用CPU/GPU、单一/多核心、(nn)MPI任务和保外功能来调查RP的薄弱/强度。当使用大部分ORNL峰会和ACC Frontera时,可以独立地使用RADC-Pilot,以及第三方工作流程系统的运行时间。