Efficient runtime task scheduling on complex memory hierarchy becomes increasingly important as modern and future High-Performance Computing (HPC) systems are progressively composed of multisocket and multi-chiplet nodes with nonuniform memory access latencies. Existing locality-aware scheduling schemes either require control of the data placement policy for memory-bound tasks or maximize locality for all classes of computations, resulting in a loss of potential performance. While such approaches are viable, an adaptive scheduling strategy is preferred to enhance locality and resource sharing efficiency using a portable programming scheme. In this paper, we propose the Adaptive Resource-Moldable Scheduler (ARMS) that dynamically maps a task at runtime to a partition spanning one or more threads, based on the task and DAG requirements. The scheduler builds an online platform-independent model for the local and non-local scheduling costs for each tuple consisting of task type (function) and task topology (task location within DAG). We evaluate ARMS using task-parallel versions of SparseLU, 2D Stencil, FMM, and MatMul as examples. Compared to previous approaches, ARMS achieves up to 3.5x performance gain over state-of-the-art locality-aware scheduling schemes.
翻译:随着现代和未来的高性能计算(HPC)系统逐渐由多种软盘和多芯节点和不统一的内存存存存取延迟时间组成,在复杂的记忆层次上高效运行任务安排变得日益重要,因为现有的有地方意识的排期计划要求控制存储任务的数据定位政策,或者为所有类别的计算最大限度地确定地点,从而造成潜在性能的丧失。虽然这种做法是可行的,但采用适应性排期战略更有利于使用便携式编程办法提高地点和资源共享效率。在本文中,我们提议适应性资源可移动调度表(ARMS),根据任务和DAG的要求,动态地将运行时的一项任务映射到一个跨越一个或多个线的分区线上。排期计划为每个图的本地和非本地排期费用建立一个在线平台独立模式,由任务类型(功能)和任务表(在DAG内的任务表位置)组成。我们用SprassLU、2D Stencil、FMM和MatMul作为例子,我们用任务单段版本来评估ARMS-CS-Simal-Cal-Simal-serveal-Cing-Cy-Cal-Cy-sal-Cy-Cal-Cal-Cy-Contraveld-Cal-Cal-s-s-Cy-Cal-sal-Cyal-Cy-s-Cy-Cy-Csal-Cs-Cs-Cs-Csal-Csal-Cs-s-s-Cs-C)。