Major bottlenecks of large-scale Federated Learning(FL) networks are the high costs for communication and computation. This is due to the fact that most of current FL frameworks only consider a star network topology where all local trained models are aggregated at a single server (e.g., a cloud server). This causes significant overhead at the server when the number of users are huge and local models' sizes are large. This paper proposes a novel edge network architecture which decentralizes the model aggregation process at the server, thereby significantly reducing the aggregation latency of the whole network. In this architecture, we propose a highly-effective in-network computation protocol consisting of two components. First, an in-network aggregation process is designed so that the majority of aggregation computations can be offloaded from cloud server to edge nodes. Second, a joint routing and resource allocation optimization problem is formulated to minimize the aggregation latency for the whole system at every learning round. The problem turns out to be NP-hard, and thus we propose a polynomial time routing algorithm which can achieve near optimal performance with a theoretical bound. Numerical results show that our proposed framework can dramatically reduce the network latency, up to 4.6 times. Furthermore, this framework can significantly decrease cloud's traffic and computing overhead by a factor of K/M, where K is the number of users and M is the number of edge nodes, in comparison with conventional baselines.
翻译:大型联邦学习(FL)网络的主要瓶颈是通信和计算的高成本。 这是因为目前大多数FL框架只考虑星际网络的地形表, 所有本地培训模型都集中在一个服务器( 如云服务器) 。 这在用户数量巨大且本地模型规模大的情况下造成服务器的巨额间接费用。 本文建议建立一个新型的边缘网络架构, 将模型汇总进程分散在服务器上, 从而大大降低整个网络的聚合性差。 在这个架构中, 我们提议了一个由两个部分组成的高效网络计算协议。 首先, 网络内汇总进程的设计可以让大部分集成计算从云服务器移到边缘节点( 如云服务器) 。 第二, 当用户数量庞大且本地模型规模巨大时, 联合的路线和资源分配优化问题将会形成一个最小化的网络结构结构。 问题变成了硬化的, 因此我们提议了一个由两个部分组成的网络间断式算法, 能够实现接近最佳的功能。 首先, 一个网络内集成的集成程序程序程序程序, 可以将大多数集成的计算方法从云端连接到边端节点。 第二, K值框架的云端/ 将大大降低了网络的网络的网络的频率 。