As a promising distributed machine learning paradigm that enables collaborative training without compromising data privacy, Federated Learning (FL) has been increasingly used in AIoT (Artificial Intelligence of Things) design. However, due to the lack of efficient management of straggling devices, existing FL methods greatly suffer from the problems of low inference accuracy and long training time. Things become even worse when taking various uncertain factors (e.g., network delays, performance variances caused by process variation) existing in AIoT scenarios into account. To address this issue, this paper proposes a novel asynchronous FL framework named GitFL, whose implementation is inspired by the famous version control system Git. Unlike traditional FL, the cloud server of GitFL maintains a master model (i.e., the global model) together with a set of branch models indicating the trained local models committed by selected devices, where the master model is updated based on both all the pushed branch models and their version information, and only the branch models after the pull operation are dispatched to devices. By using our proposed Reinforcement Learning (RL)-based device selection mechanism, a pulled branch model with an older version will be more likely to be dispatched to a faster and less frequently selected device for the next round of local training. In this way, GitFL enables both effective control of model staleness and adaptive load balance of versioned models among straggling devices, thus avoiding the performance deterioration. Comprehensive experimental results on well-known models and datasets show that, compared with state-of-the-art asynchronous FL methods, GitFL can achieve up to 2.64X training acceleration and 7.88% inference accuracy improvements in various uncertain scenarios.
翻译:作为一种有希望的分布式机器学习模式,可以合作培训而不损害数据隐私,联邦学习组织(FL)越来越多地用于设计AIoT(物质人工智能)设计,然而,由于缺乏对悬浮装置的有效管理,现有FL方法因低发率准确度和长时间培训时间等问题而深受其害。如果采用AIOT情景中存在的各种不确定因素(例如网络延迟、因流程变异造成的性能差异),情况就更加糟糕。为了解决这一问题,本文件提议了一个名为GitFLL的新颖的无节奏FL框架,其实施受到著名的版本加速控制系统的启发。GitFL的云服务器不同于传统的FL,它维持着一个主模型(即全球模型),以及一套表明选定装置所承诺的经过培训的当地模型(例如网络延迟、流程变异性模型),而只有拉动后的分支模型被发送到装置中。通过我们提议的SEngement Redustress (RL) 和滚动装置选择的系统选择机制,一个经过更快速的分支模型可以在老版本中显示成本控制系统。