异步-HFL：基于分层IoT网络的高效且鲁棒的异步联邦学习 (Async-HFL: Efficient and Robust Asynchronous Federated Learning in Hierarchical IoT Networks)

Federated Learning (FL) has gained increasing interest in recent years as a distributed on-device learning paradigm. However, multiple challenges remain to be addressed for deploying FL in real-world Internet-of-Things (IoT) networks with hierarchies. Although existing works have proposed various approaches to account data heterogeneity, system heterogeneity, unexpected stragglers and scalibility, none of them provides a systematic solution to address all of the challenges in a hierarchical and unreliable IoT network. In this paper, we propose an asynchronous and hierarchical framework (Async-HFL) for performing FL in a common three-tier IoT network architecture. In response to the largely varied delays, Async-HFL employs asynchronous aggregations at both the gateway and the cloud levels thus avoids long waiting time. To fully unleash the potential of Async-HFL in converging speed under system heterogeneities and stragglers, we design device selection at the gateway level and device-gateway association at the cloud level. Device selection chooses edge devices to trigger local training in real-time while device-gateway association determines the network topology periodically after several cloud epochs, both satisfying bandwidth limitation. We evaluate Async-HFL's convergence speedup using large-scale simulations based on ns-3 and a network topology from NYCMesh. Our results show that Async-HFL converges 1.08-1.31x faster in wall-clock time and saves up to 21.6% total communication cost compared to state-of-the-art asynchronous FL algorithms (with client selection). We further validate Async-HFL on a physical deployment and observe robust convergence under unexpected stragglers.

翻译：过去几年中，联邦学习（FL）作为一种分布式的本地化学习范例，已经引起越来越多的关注。然而，对于在具有层次结构的实际物联网（IoT）网络中部署FL还存在许多挑战。虽然现有的工作已经提出了各种方法来考虑数据异构性、系统异构性、意外的拖延者和可扩展性，但没有一种方法提供系统性的解决方案来解决分层和不可靠的IoT网络中的所有挑战。在本文中，我们提出了一种异步和分层的框架（异步-HFL），用于在常见的三层IoT网络架构中执行FL。为了应对大量的差异化延迟，异步-HFL在网关和云级别都采用异步聚合，从而避免了长时间的等待时间。为了在异构系统和拖动者下充分发挥异步-HFL的收敛速度潜力，我们设计了网关级别的设备选择和云级别的设备-网关关联。设备选择选择边缘设备以触发实时的本地训练，而设备-网关关联则在若干个云级别epoch之后周期性地确定网络拓扑，同时满足带宽限制。我们使用基于ns-3的大规模仿真和来自NYCMesh的网络拓扑来评估异步-HFL的收敛速度提升。我们的结果表明，与基于客户端选择的最先进的异步FL算法相比，异步-HFL以墙钟时间的1.08-1.31倍收敛更快，同时节省高达21.6％的总通信成本。此外，我们对物理部署进行了异步-HFL验证，并观察到在意外拖延者下的坚韧收敛。