Federated Learning (FL) has gained increasing interest in recent years as a distributed on-device learning paradigm. However, multiple challenges remain to be addressed for deploying FL in real-world Internet-of-Things (IoT) networks with hierarchies. Although existing works have proposed various approaches to account data heterogeneity, system heterogeneity, unexpected stragglers and scalibility, none of them provides a systematic solution to address all of the challenges in a hierarchical and unreliable IoT network. In this paper, we propose an asynchronous and hierarchical framework (Async-HFL) for performing FL in a common three-tier IoT network architecture. In response to the largely varied delays, Async-HFL employs asynchronous aggregations at both the gateway and the cloud levels thus avoids long waiting time. To fully unleash the potential of Async-HFL in converging speed under system heterogeneities and stragglers, we design device selection at the gateway level and device-gateway association at the cloud level. Device selection chooses edge devices to trigger local training in real-time while device-gateway association determines the network topology periodically after several cloud epochs, both satisfying bandwidth limitation. We evaluate Async-HFL's convergence speedup using large-scale simulations based on ns-3 and a network topology from NYCMesh. Our results show that Async-HFL converges 1.08-1.31x faster in wall-clock time and saves up to 21.6% total communication cost compared to state-of-the-art asynchronous FL algorithms (with client selection). We further validate Async-HFL on a physical deployment and observe robust convergence under unexpected stragglers.
翻译:联邦学习(FL)作为分布式设备上的学习模式,在近年来越来越受到关注。但在具有层次结构的物联网(IoT)网络中部署FL仍然面临多个挑战。尽管现有的工作提出了各种方法来解决数据异构性,系统异构性,预期之外的迟钝和可扩展性等问题,但没有任何一种方法提供系统性的解决方案来解决层次结构和不可靠的IoT网络中的所有挑战。在本文中,我们提出了一个异步和层次的架构(Async-HFL),用于在常见的三层IoT网络架构中执行FL。为了应对巨大的延迟,Async-HFL在网关和云层级别都采用异步聚合,从而避免长时间等待。为了充分发挥异步-HFL在系统异构和迟滞下的收敛速度,我们设计了网关级别的设备选择和云级别的设备-网关关联。设备选择选择边缘设备以实时触发本地培训,设备-网关关联则在若干个云周期之后定期确定网络拓扑,均满足带宽限制。我们使用基于ns-3和来自NYCMesh的网络拓扑的大规模模拟评估Async-HFL的收敛速度提升。我们的结果显示,与最先进的异步FL算法(具有客户端选择)相比,Async-HFL在墙上时钟时间内收敛速度提高了1.08-1.31倍,总通信成本节省了多达21.6%。我们进一步在物理部署上验证了Async-HFL,观察到了对意外迟钝的稳健收敛。