Hotllite:为基于任务的分配系统提供高效的和有过失的集体交流 (Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems)

Task-based distributed frameworks (e.g., Ray, Dask, Hydro) have become increasingly popular for distributed applications that contain asynchronous and dynamic workloads, including asynchronous gradient descent, reinforcement learning, and model serving. As more data-intensive applications move to run on top of task-based systems, collective communication efficiency has become an important problem. Unfortunately, traditional collective communication libraries (e.g., MPI, Horovod, NCCL) are an ill fit, because they require the communication schedule to be known before runtime and they do not provide fault tolerance. We design and implement Hoplite, an efficient and fault-tolerant collective communication layer for task-based distributed systems. Our key technique is to compute data transfer schedules on the fly and execute the schedules efficiently through fine-grained pipelining. At the same time, when a task fails, the data transfer schedule adapts quickly to allow other tasks to keep making progress. We apply Hoplite to a popular task-based distributed framework, Ray. We show that Hoplite speeds up asynchronous stochastic gradient descent, reinforcement learning, and serving an ensemble of machine learning models that are difficult to execute efficiently with traditional collective communication by up to 7.8x, 3.9x, and 3.3x, respectively.

翻译：以任务为基础的分布式框架(例如,雷、达斯克、水力发电)在分布式应用中越来越受欢迎,这些应用含有不同步和动态的工作量,包括不同步的梯度下降、强化学习和模型服务。随着数据密集应用在基于任务的系统之上运行,集体通信效率已成为一个重要问题。不幸的是,传统的集体通信图书馆(例如,MPI、Horovod、NCCL)是一个不合适的地方,因为它们要求在运行前了解通信时间表,不能提供错误的容忍度。我们为基于任务的分布式系统设计和实施Hoplite,一个高效和不易出错的集体通信层。我们的关键技术是计算飞行数据传输时间表,并通过微细的管道高效率地执行时间表。与此同时,当任务失败时,数据传输时间表会迅速调整,以便其他任务能够不断取得进展。我们把Hoplite应用到基于任务的分布式框架里,它们不会提供错误的容忍度。我们发现,Hoplite加快了基于同步的梯度梯度下降、强化和不耐错觉的集体通信层。我们的关键技术是,通过传统的3x学习和不断学习模式分别执行。