Scientists increasingly rely on Python tools to perform scalable distributed memory array operations using rich, NumPy-like expressions. However, many of these tools rely on dynamic schedulers optimized for abstract task graphs, which often encounter memory and network bandwidth-related bottlenecks due to sub-optimal data and operator placement decisions. Tools built on the message passing interface (MPI), such as ScaLAPACK and SLATE, have better scaling properties, but these solutions require specialized knowledge to use. In this work, we present NumS, an array programming library which optimizes NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS). LSHS is a local search method which optimizes operator placement by minimizing maximum memory and network load on any given node within a distributed system. Coupled with a heuristic for load balanced data layouts, our approach is capable of attaining communication lower bounds on some common numerical operations, and our empirical study shows that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem. On terabyte-scale data, NumS achieves competitive performance to SLATE on DGEMM, up to 20x speedup over Dask on a key operation for tensor factorization, and a 2x speedup on logistic regression compared to Dask ML and Spark's MLlib.
翻译:科学家越来越依赖 Python 工具, 以使用丰富、 NumPy 式的表达式进行可缩放分布式的记忆阵列操作。 但是, 许多这些工具都依赖为抽象任务图表优化的动态调度器, 其间经常遇到由于亚最佳数据和操作员布局决定而导致的记忆和网络带宽瓶颈。 以信息传递界面( MPI) 为基础的工具, 如 ScaLAPACK 和 SLATE, 具有更好的缩放特性, 但是这些解决方案需要专门的知识才能使用。 在这项工作中, 我们展示了 NumSS, 一个以基于任务分布式系统优化 NumPy 类似表达式的阵列编程库。 这是通过一个叫作“ 负载模拟高压带宽度任务图” (LSHSHS) 的新型调度器实现的性能。 LSHSHS是一个本地搜索方法, 通过在分布式系统内的任何节点上最大限度地减少存储和网络运行速度, 需要将Slex 20 级的轨迹到 递增缩缩缩缩缩缩缩缩缩成 。