The demand for artificial intelligence has grown significantly over the last decade and this growth has been fueled by advances in machine learning techniques and the ability to leverage hardware acceleration. However, in order to increase the quality of predictions and render machine learning solutions feasible for more complex applications, a substantial amount of training data is required. Although small machine learning models can be trained with modest amounts of data, the input for training larger models such as neural networks grows exponentially with the number of parameters. Since the demand for processing training data has outpaced the increase in computation power of computing machinery, there is a need for distributing the machine learning workload across multiple machines, and turning the centralized into a distributed system. These distributed systems present new challenges, first and foremost the efficient parallelization of the training process and the creation of a coherent model. This article provides an extensive overview of the current state-of-the-art in the field by outlining the challenges and opportunities of distributed machine learning over conventional (centralized) machine learning, discussing the techniques used for distributed machine learning, and providing an overview of the systems that are available.
翻译:过去十年来,对人工智能的需求大幅增长,这种增长因机器学习技术的进步和利用硬件加速能力的提高而得到推动,然而,为提高预测质量和使机器学习解决方案对更为复杂的应用具有可行性,需要大量的培训数据。虽然小型机器学习模型可以使用少量数据进行培训,但神经网络等大型模型的培训投入随着参数数的增加而成倍增长。由于对处理培训数据的需求超过了计算机机器计算能力的增长,因此有必要将机器学习工作量分配给多台机器,并将集中式计算机转换成一个分布式系统。这些分布式系统首先提出了新的挑战,首先是培训进程的有效平行化以及创建一个连贯的模式。这篇文章概述了分散式机器在常规(集中式)机器学习方面的挑战和机遇,讨论了用于分散式机器学习的技术,并概述了现有的系统。