Distributed Machine Learning refers to the practice of training a model on multiple computers or devices that can be called nodes. Additionally, serverless computing is a new paradigm for cloud computing that uses functions as a computational unit. Serverless computing can be effective for distributed learning systems by enabling automated resource scaling, less manual intervention, and cost reduction. By distributing the workload, distributed machine learning can speed up the training process and allow more complex models to be trained. Several topologies of distributed machine learning have been established (centralized, parameter server, peer-to-peer). However, the parameter server architecture may have limitations in terms of fault tolerance, including a single point of failure and complex recovery processes. Moreover, training machine learning in a peer-to-peer (P2P) architecture can offer benefits in terms of fault tolerance by eliminating the single point of failure. In a P2P architecture, each node or worker can act as both a server and a client, which allows for more decentralized decision making and eliminates the need for a central coordinator. In this position paper, we propose exploring the use of serverless computing in distributed machine learning training and comparing the performance of P2P architecture with the parameter server architecture, focusing on cost reduction and fault tolerance.
翻译:分散式机器学习是指在多种计算机或可称为节点的装置上培训模型的做法。此外,无服务器计算是使用计算单位功能的云计算的新范例。无服务器计算可以通过自动资源缩放、减少人工干预和降低成本,对分布式学习系统有效。通过分配工作量,分散式机器学习可以加快培训过程,并允许对更复杂的模型进行培训。已建立了分布式机器学习的若干结构(集中式、参数服务器、同行对等)。然而,参数服务器结构可能限制过错容忍度,包括单一的故障点和复杂的恢复过程。此外,在对等对等式(P2P)结构中培训机器学习可以通过消除单一的故障点,在差分容忍性方面带来好处。在P2P结构中,每个节点或工人都可以同时充当服务器和客户,从而可以更分散决策,消除中央协调员的需要。在这份立场文件中,我们提议探讨在分散式机器学习培训中使用不使用服务器的计算方法,并将容忍性P2P结构的性能与降低成本的系统比较。</s>