As an innovative solution for privacy-preserving machine learning (ML), federated learning (FL) is attracting much attention from research and industry areas. While new technologies proposed in the past few years do evolve the FL area, unfortunately, the evaluation results presented in these works fall short in integrity and are hardly comparable because of the inconsistent evaluation metrics and the lack of a common platform. In this paper, we propose a comprehensive evaluation framework for FL systems. Specifically, we first introduce the ACTPR model, which defines five metrics that cannot be excluded in FL evaluation, including Accuracy, Communication, Time efficiency, Privacy, and Robustness. Then we design and implement a benchmarking system called FedEval, which enables the systematic evaluation and comparison of existing works under consistent experimental conditions. We then provide an in-depth benchmarking study between the two most widely-used FL mechanisms, FedSGD and FedAvg. The benchmarking results show that FedSGD and FedAvg both have advantages and disadvantages under the ACTPR model. For example, FedSGD is barely influenced by the none independent and identically distributed (non-IID) data problem, but FedAvg suffers from a decline in accuracy of up to 9% in our experiments. On the other hand, FedAvg is more efficient than FedSGD regarding time consumption and communication. Lastly, we excavate a set of take-away conclusions, which are very helpful for researchers in the FL area.
翻译:作为保护隐私的机器学习(ML)的创新解决方案,联邦学习(FL)正在吸引研究和行业领域的大量关注。虽然过去几年提出的新技术确实使FL领域发生了演变,但不幸的是,这些作品中的评价结果缺乏完整性,而且由于评价指标不一致和缺乏一个共同平台,因此很难与之相比。在本文件中,我们提出了FL系统的全面评价框架。具体地说,我们首先采用了ACTPR模式,该模式界定了在FL评价中无法排除的五个衡量标准,包括Accureacy、通信、时间效率、隐私和强力。然后,我们设计并实施了称为FedEval的基准系统系统,以便能够在前后一致的实验条件下对现有的工程进行系统评价和比较。我们随后对FSGD和FedAvg这两个最广泛使用的FL机制进行了深入的基准研究。基准结果表明,FedSGD和FedAvg在A模型中具有优劣的优势和劣势。例如,FedSGD区域几乎没有受到一个独立和相同的分布区(非IID)的影响,而FedA在FSG的准确性实验中却了我们FD数据在FSG的准确性上比FD数据中更低。