Classic Machine Learning techniques require training on data available in a single data lake. However, aggregating data from different owners is not always convenient for different reasons, including security, privacy and secrecy. Data carry a value that might vanish when shared with others; the ability to avoid sharing the data enables industrial applications where security and privacy are of paramount importance, making it possible to train global models by implementing only local policies which can be run independently and even on air-gapped data centres. Federated Learning (FL) is a distributed machine learning approach which has emerged as an effective way to address privacy concerns by only sharing local AI models while keeping the data decentralized. Two critical challenges of Federated Learning are managing the heterogeneous systems in the same federated network and dealing with real data, which are often not independently and identically distributed (non-IID) among the clients. In this paper, we focus on the second problem, i.e., the problem of statistical heterogeneity of the data in the same federated network. In this setting, local models might be strayed far from the local optimum of the complete dataset, thus possibly hindering the convergence of the federated model. Several Federated Learning algorithms, such as FedAvg, FedProx and Federated Curvature (FedCurv), aiming at tackling the non-IID setting, have already been proposed. This work provides an empirical assessment of the behaviour of FedAvg and FedCurv in common non-IID scenarios. Results show that the number of epochs per round is an important hyper-parameter that, when tuned appropriately, can lead to significant performance gains while reducing the communication cost. As a side product of this work, we release the non-IID version of the datasets we used so to facilitate further comparisons from the FL community.
翻译:经典的机器学习技术需要在单个数据湖上进行训练。然而,从不同所有者处聚合数据并不总是方便,原因有很多,包括安全性、隐私和保密性等。数据具有一定的价值,当与他人共享时可能会消失;避免共享数据的能力使得安全性和隐私性至关重要的工业应用成为可能,使得可以通过实施仅在本地运行的本地策略来训练全局模型,甚至可以在空气隔离的数据中心上运行。联邦学习(FL)是一种分布式机器学习方法,它已经成为通过仅共享本地AI模型来解决隐私问题的一种有效方法,同时保持数据分散。联邦学习面临的两个关键挑战是管理相同联邦网络中的异构系统和处理真实数据,这些数据在客户端之间往往不是独立和相同分布(非IID)的。在这种情况下,本地模型可能会远离完整数据集的局部极值,因此可能会阻碍联邦模型的收敛。已经提出了几种旨在解决非IID设置的联邦学习算法,例如FedAvg、FedProx和FedCurv。本文对FedAvg和FedCurv在常见的非IID情况下的行为进行了实证评估。结果表明,每轮的时代数是一个重要的超参数,当适当调整时,可以实现显著的性能提升,同时降低通信成本。作为这项工作的副产品,我们发布了我们使用的非IID数据集的版本,从而有助于FL社区进行进一步的比较。