Federated Learning has promised a new approach to resolve the challenges in machine learning by bringing computation to the data. The popularity of the approach has led to rapid progress in the algorithmic aspects and the emergence of systems capable of simulating Federated Learning. State of art systems in Federated Learning support a single node aggregator that is insufficient to train a large corpus of devices or train larger-sized models. As the model size or the number of devices increase the single node aggregator incurs memory and computation burden while performing fusion tasks. It also faces communication bottlenecks when a large number of model updates are sent to a single node. We classify the workload for the aggregator into categories and propose a new aggregation service for handling each load. Our aggregation service is based on a holistic approach that chooses the best solution depending on the model update size and the number of clients. Our system provides a fault-tolerant, robust and efficient aggregation solution utilizing existing parallel and distributed frameworks. Through evaluation, we show the shortcomings of the state of art approaches and how a single solution is not suitable for all aggregation requirements. We also provide a comparison of current frameworks with our system through extensive experiments.
翻译:联邦学习协会承诺采取新的方法,通过计算数据来解决机器学习的挑战,该方法的普及导致算法方面迅速取得进展,并出现了能够模拟联邦学习的系统。联邦学习联合会的艺术系统支持一个单一的节点聚合器,它不足以培训大量装置或培训大型模型。由于模型大小或装置数量增加了单一节点聚合器,在执行聚合任务时,它要承担记忆和计算负担。当大量模式更新被发送到单一节点时,它也面临着通信瓶颈。我们将聚合器的工作量分类为类别,并提出处理每件负荷的新汇总服务。我们的总合服务基于一种综合方法,根据模型更新规模和客户数量选择最佳解决方案。我们的系统利用现有平行和分布的框架,提供了一种容错、稳健和有效的汇总解决方案。通过评价,我们展示了艺术状态方法的缺点,以及一种单一解决方案如何不适合于所有汇总要求。我们还通过广泛的实验,将当前的框架与我们的系统进行比较。