实现分布式机器学习系统高效在线日程安排 (Toward Efficient Online Scheduling for Distributed Machine Learning Systems)

Recent years have witnessed a rapid growth of distributed machine learning (ML) frameworks, which exploit the massive parallelism of computing clusters to expedite ML training. However, the proliferation of distributed ML frameworks also introduces many unique technical challenges in computing system design and optimization. In a networked computing cluster that supports a large number of training jobs, a key question is how to design efficient scheduling algorithms to allocate workers and parameter servers across different machines to minimize the overall training time. Toward this end, in this paper, we develop an online scheduling algorithm that jointly optimizes resource allocation and locality decisions. Our main contributions are three-fold: i) We develop a new analytical model that considers both resource allocation and locality; ii) Based on an equivalent reformulation and observations on the worker-parameter server locality configurations, we transform the problem into a mixed packing and covering integer program, which enables approximation algorithm design; iii) We propose a meticulously designed approximation algorithm based on randomized rounding and rigorously analyze its performance. Collectively, our results contribute to the state of the art of distributed ML system optimization and algorithm design.

翻译：近年来,分布式机器学习(ML)框架迅速增长,利用计算组群的大规模平行性加速ML培训。然而,分布式ML框架的激增也给计算系统设计和优化带来了许多独特的技术挑战。在一个支持大量培训工作的网络化计算机组群中,一个关键问题是如何设计高效的排期算法,在不同机器之间分配工人和参数服务器,以最大限度地减少总体培训时间。为此,我们在本文件中开发了一个在线排期算法,以共同优化资源分配和地点决定。我们的主要贡献有三重:一)我们开发了一个新的分析模型,既考虑资源分配,也考虑地点;二)根据对工人-参数服务器地点配置的同等重新设计和观察,我们把问题转换成混合包装和覆盖整数程序,从而能够进行近似算法设计。三)我们建议了一种精心设计的近似算法,以随机组合组合和严格分析其性能为基础。我们的主要贡献有三重:一)我们开发了一个新的分析模型,既考虑资源分配,又考虑地点;二)基于对工人-参数服务器地点配置的配置式系统优化和算法设计。