Federated learning (FL) supports training models on geographically distributed devices. However, traditional FL systems adopt a centralized synchronous strategy, putting high communication pressure and model generalization challenge. Existing optimizations on FL either fail to speedup training on heterogeneous devices or suffer from poor communication efficiency. In this paper, we propose HADFL, a framework that supports decentralized asynchronous training on heterogeneous devices. The devices train model locally with heterogeneity-aware local steps using local data. In each aggregation cycle, they are selected based on probability to perform model synchronization and aggregation. Compared with the traditional FL system, HADFL can relieve the central server's communication pressure, efficiently utilize heterogeneous computing power, and can achieve a maximum speedup of 3.15x than decentralized-FedAvg and 4.68x than Pytorch distributed training scheme, respectively, with almost no loss of convergence accuracy.
翻译:联邦学习(FL)支持关于地理分布装置的培训模式。然而,传统FL系统采用集中同步战略,高通信压力和模式通用挑战。现有FL优化不是不能加快对不同装置的培训,就是通信效率低下。在本文中,我们提议HADFL,这是一个支持分散对不同装置的零同步培训的框架。该设备利用当地数据对具有异质性认识的地方步骤进行本地培训。在每个聚合周期中,它们都是根据执行模型同步和聚合的概率选择的。与传统的FL系统相比,HAFLL可以减轻中央服务器的通信压力,有效利用多种计算能力,并实现最大速度3.15x,比分散的FedAvg和Pytoch分发的培训计划分别实现最大速度3.15x和4.68x,几乎不会失去趋同性。