The increasing number and scale of federated learning (FL) jobs necessitates resource efficient scheduling and management of aggregation to make the economics of cloud-hosted aggregation work. Existing FL research has focused on the design of FL algorithms and optimization, and less on the efficacy of aggregation. Existing FL platforms often employ aggregators that actively wait for model updates. This wastes computational resources on the cloud, especially in large scale FL settings where parties are intermittently available for training. In this paper, we propose a new FL aggregation paradigm -- "just-in-time" (JIT) aggregation that leverages unique properties of FL jobs, especially the periodicity of model updates, to defer aggregation as much as possible and free compute resources for other FL jobs or other datacenter workloads. We describe a novel way to prioritize FL jobs for aggregation, and demonstrate using multiple datasets, models and FL aggregation algorithms that our techniques can reduce resource usage by 60+\% when compared to eager aggregation used in existing FL platforms. We also demonstrate that using JIT aggregation has negligible overhead and impact on the latency of the FL job.
翻译:由于联合学习(FL)职位的数量和规模的增加,需要资源高效率地安排和管理总合,使云托管的总合具有经济学价值。现有的FL研究侧重于FL算法和优化的设计,而较少于总合的效益。现有的FL平台往往使用积极等待模型更新的聚合器。这种在云上浪费计算资源的情况,特别是在大范围的FL环境中,因为有各方间歇地接受培训。在本文中,我们提议一个新的FL聚合模式 -- -- " 准时"(JIT)集成,利用FL工作的独特特性,特别是模型更新周期,尽可能推迟集合,并免费计算其他FL工作或其他数据中心工作量的资源。我们描述了一种新颖的方法,将FL工作排在一起,并用多种数据集、模型和FL汇总算法证明,与现有FL平台的热汇总相比,我们的技术可以减少资源使用量60 ⁇ 。我们还表明,使用JIT总合可以忽略很少的间接费用和对FL工作的耐久性产生影响。