Federated Learning (FL) has gained increasing interest in recent years as a distributed on-device learning paradigm. However, multiple challenges remain to be addressed for deploying FL in real-world Internet-of-Things (IoT) networks with hierarchies. Although existing works have proposed various approaches to account data heterogeneity, system heterogeneity, unexpected stragglers and scalibility, none of them provides a systematic solution to address all of the challenges in a hierarchical and unreliable IoT network. In this paper, we propose an asynchronous and hierarchical framework (Async-HFL) for performing FL in a common three-tier IoT network architecture. In response to the largely varied delays, Async-HFL employs asynchronous aggregations at both the gateway and the cloud levels thus avoids long waiting time. To fully unleash the potential of Async-HFL in converging speed under system heterogeneities and stragglers, we design device selection at the gateway level and device-gateway association at the cloud level. Device selection chooses edge devices to trigger local training in real-time while device-gateway association determines the network topology periodically after several cloud epochs, both satisfying bandwidth limitation. We evaluate Async-HFL's convergence speedup using large-scale simulations based on ns-3 and a network topology from NYCMesh. Our results show that Async-HFL converges 1.08-1.31x faster in wall-clock time and saves up to 21.6% total communication cost compared to state-of-the-art asynchronous FL algorithms (with client selection). We further validate Async-HFL on a physical deployment and observe robust convergence under unexpected stragglers.
翻译:联邦学习联盟(FL)在最近几年里随着一个分级和不可靠的 IOT 网络的分布而越来越受关注。然而,在用等级结构在现实世界的IOT互联网网络中部署FL方面,仍有多种挑战有待解决。尽管现有的工程提出了多种方法来核算数据异质性、系统异质性、意想不到的分流和伸缩性,但这些工程都没有提供系统化的解决方案来应对等级和不可靠的 IOT 网络中的所有挑战。在本文中,我们提议建立一个不同步和等级框架(Async-HFHFL),用于在共同的三级IOT网络架构中执行FL。在大部分不同的延迟情况下,Async-LL在门户和云层一级上都采用非同步的集合。为了充分释放Async-HFL在系统超常性和不可靠的同步速度下进一步整合的潜力(We-HFS-ldroc-Servic-Servational-lation)1. 我们的Asyn-Clock-listaltraction 和Ax-lock-lock-lock-lock-lock-lock-lock-lock-lock-lock-lock-lock-lock-leval-lational-lational-leval-lation) 一级, 选择一个不定期的系统用户选择一个直选的顶端装置,然后以显示的升级的升级的升级装置。