Recent years have witnessed a large amount of decentralized data in multiple (edge) devices of end-users, while the aggregation of the decentralized data remains difficult for machine learning jobs due to laws or regulations. Federated Learning (FL) emerges as an effective approach to handling decentralized data without sharing the sensitive raw data, while collaboratively training global machine learning models. The servers in FL need to select (and schedule) devices during the training process. However, the scheduling of devices for multiple jobs with FL remains a critical and open problem. In this paper, we propose a novel multi-job FL framework to enable the parallel training process of multiple jobs. The framework consists of a system model and two scheduling methods. In the system model, we propose a parallel training process of multiple jobs, and construct a cost model based on the training time and the data fairness of various devices during the training process of diverse jobs. We propose a reinforcement learning-based method and a Bayesian optimization-based method to schedule devices for multiple jobs while minimizing the cost. We conduct extensive experimentation with multiple jobs and datasets. The experimental results show that our proposed approaches significantly outperform baseline approaches in terms of training time (up to 8.67 times faster) and accuracy (up to 44.6% higher).
翻译:近年来,最终用户的多种(前沿)设备中有大量分散的数据,而由于法律或法规,分散的数据的汇总对于机器学习工作仍然很困难。联邦学习(FL)是处理分散数据的有效方法,不分享敏感的原始数据,同时合作培训全球机器学习模式。FL服务器在培训过程中需要选择(和时间安排)设备。然而,与FL一起安排多种工作的设备仍是一个关键和开放的问题。在本文件中,我们提议了一个新的多工作FL框架,以便能够进行多种工作的平行培训过程。框架包括一个系统模式和两种时间安排方法。在系统模型中,我们提议一个平行的多工作培训进程,并根据培训时间和不同工作培训过程中各种设备的数据公平性建立一个成本模型。我们提议加强学习方法和巴伊西亚优化方法,为多种工作安排各种设备的时间安排,同时尽量减少成本。我们用多种工作和数据集进行广泛的实验。实验结果显示,我们所提议的方法大大超出了44-6%的更高级培训时间(直到8.67)。