Most conventional Federated Learning (FL) models are using a star network topology where all users aggregate their local models at a single server (e.g., a cloud server). That causes significant overhead in terms of both communications and computing at the server, delaying the training process, especially for large scale FL systems with straggling nodes. This paper proposes a novel edge network architecture that enables decentralizing the model aggregation process at the server, thereby significantly reducing the training delay for the whole FL network. Specifically, we design a highly-effective in-network computation protocol (INC) consisting of a user scheduling mechanism, an in-network aggregation process (INA) which is designed for both primal- and primal-dual methods in distributed machine learning problems, and a network routing algorithm. Under the proposed INA, we then formulate a joint routing and resource optimization problem, aiming to minimize the aggregation latency. The problem is NP-hard, and thus we propose a polynomial time routing algorithm which can achieve near optimal performance with a theoretical bound. Simulation results showed that the proposed INC framework can not only help reduce the FL training latency, up to 5.6 times, but also significantly decrease cloud's traffic and computing overhead. This can enable large-scale FL.
翻译:多数传统联邦学习(FL)模式正在使用恒星网络表层,所有用户都用单一服务器(如云服务器)汇总其本地模型。这在通信和服务器计算方面造成了巨大的间接费用,推迟了培训过程,特别是使用螺旋交错节点的大型FL系统的培训过程。本文提出一个新的边缘网络结构,使服务器能够分散模型汇总过程,从而大大减少整个FL网络的培训延误。具体地说,我们设计了一个高效的网络内部计算协议,由用户排期机制、一个网络内集成进程(INA)组成,该程序是为分布式机器学习问题的初等和初等方法设计的,以及一个网络路程算法。在拟议的INA之下,我们随后制定了一个联合路线和资源优化问题,目的是最大限度地减少总合的惯性。这个问题是硬的,因此我们提出了一种多线路程算法,可以实现接近最佳的功能,并带有理论约束。模拟结果显示,拟议的INC框架不仅能够帮助大大减少FL的频率,而且能够大大降低交通率。