异步联邦学习中的异构感知自适应部分训练：TimelyFL (TimelyFL: Heterogeneity-aware Asynchronous Federated Learning with Adaptive Partial Training)

In cross-device Federated Learning (FL) environments, scaling synchronous FL methods is challenging as stragglers hinder the training process. Moreover, the availability of each client to join the training is highly variable over time due to system heterogeneities and intermittent connectivity. Recent asynchronous FL methods (e.g., FedBuff) have been proposed to overcome these issues by allowing slower users to continue their work on local training based on stale models and to contribute to aggregation when ready. However, we show empirically that this method can lead to a substantial drop in training accuracy as well as a slower convergence rate. The primary reason is that fast-speed devices contribute to many more rounds of aggregation while others join more intermittently or not at all, and with stale model updates. To overcome this barrier, we propose TimelyFL, a heterogeneity-aware asynchronous FL framework with adaptive partial training. During the training, TimelyFL adjusts the local training workload based on the real-time resource capabilities of each client, aiming to allow more available clients to join in the global update without staleness. We demonstrate the performance benefits of TimelyFL by conducting extensive experiments on various datasets (e.g., CIFAR-10, Google Speech, and Reddit) and models (e.g., ResNet20, VGG11, and ALBERT). In comparison with the state-of-the-art (i.e., FedBuff), our evaluations reveal that TimelyFL improves participation rate by 21.13%, harvests 1.28x - 2.89x more efficiency on convergence rate, and provides a 6.25% increment on test accuracy.

翻译：在跨设备的联邦学习环境中，规模化同步联邦学习方法是有挑战性的，因为一个慢速设备会阻碍整个训练过程。此外，由于系统异构性和间歇性连接，每个客户端能够加入训练的可用性随时间高度变化。最近的异步联邦学习方法（例如FedBuff）已被提出以克服这些问题，允许较慢的用户基于过期模型在本地训练并在准备就绪时贡献到聚合。然而，我们通过实验证明，这种方法可能会导致训练精度显著降低以及收敛速度变慢。主要原因是快速设备对多轮聚合有更多贡献，而其他设备更加间歇性或根本不参与，并带有过期的模型更新。为了克服这一障碍，我们提出了TimelyFL，这是一种异构感知的异步联邦学习框架，具备自适应的部分训练。在训练过程中，TimelyFL根据每个客户端的实时资源能力调整本地训练工作量，旨在允许更多可用客户端加入全球更新而不会过期。我们通过在各种数据集（例如CIFAR-10、Google语音和Reddit）和模型（例如ResNet20、VGG11和ALBERT）上进行大量实验来展示TimelyFL的性能优势。与最先进的方法（即FedBuff）相比，我们的评估显示TimelyFL提高了参与率21.13％，收获了1.28倍-2.89倍的收敛速度效率，并提供了6.25％的测试准确度增量。

相关内容

联邦学习

关注 199

联邦学习（Federated Learning）是一种新兴的人工智能基础技术，在 2016 年由谷歌最先提出，原本用于解决安卓手机终端用户在本地更新模型的问题，其设计目标是在保障大数据交换时的信息安全、保护终端数据和个人数据隐私、保证合法合规的前提下，在多参与方或多计算结点之间开展高效率的机器学习。其中，联邦学习可使用的机器学习算法不局限于神经网络，还包括随机森林等重要算法。联邦学习有望成为下一代人工智能协同算法和协作网络的基础。

博士论文《联邦学习仿真器》221页，米兰理工大学

专知会员服务

31+阅读 · 2023年3月14日

面向大规模模型的分布式ML系统:动态分布式训练和可扩展的联邦学习

专知会员服务

61+阅读 · 2022年8月9日

【CVPR 2022】基于本地正则化和稀疏化差分隐私的联邦学习，Differentially Private Federated Learning with Local Regularization and Sparsification

专知会员服务

17+阅读 · 2022年3月19日

【SIGIR2021】ScaleFreeCTR：超大规模Embedding推荐模型分布式训练系统

专知会员服务

28+阅读 · 2021年4月26日