Federated learning enables edge devices to train a global model collaboratively without exposing their data. Despite achieving outstanding advantages in computing efficiency and privacy protection, federated learning faces a significant challenge when dealing with non-IID data, i.e., data generated by clients that are typically not independent and identically distributed. In this paper, we tackle a new type of Non-IID data, called cluster-skewed non-IID, discovered in actual data sets. The cluster-skewed non-IID is a phenomenon in which clients can be grouped into clusters with similar data distributions. By performing an in-depth analysis of the behavior of a classification model's penultimate layer, we introduce a metric that quantifies the similarity between two clients' data distributions without violating their privacy. We then propose an aggregation scheme that guarantees equality between clusters. In addition, we offer a novel local training regularization based on the knowledge-distillation technique that reduces the overfitting problem at clients and dramatically boosts the training scheme's performance. We theoretically prove the superiority of the proposed aggregation over the benchmark FedAvg. Extensive experimental results on both standard public datasets and our in-house real-world dataset demonstrate that the proposed approach improves accuracy by up to 16% compared to the FedAvg algorithm.
翻译:尽管在计算效率和隐私保护方面取得了显著的优势,但联邦学习在处理非IID数据时面临重大挑战,即客户通常不独立和同样分布的数据。在本文中,我们处理的是在实际数据集中发现的新型非IID数据,称为集群扭曲的非IID数据。分组扭曲的非IID是一种可以将客户分组成具有类似数据分布的集群的现象。通过深入分析分类模型倒数层的行为,我们引入了一种衡量标准,在不侵犯隐私的情况下衡量两个客户数据分布的相似性。我们随后提出了一个保证各组之间平等的汇总计划。此外,我们根据知识蒸馏技术提供了一种新的本地培训规范化,以降低客户过度匹配的问题,大大提升了培训计划的绩效。我们从理论上证明拟议汇总优于基准FedAvg的倒数倒数层,我们引入了一种衡量两个客户数据分布相似性的尺度,同时不侵犯他们的隐私。我们随后提出了一个保证各组之间平等的汇总计划。此外,我们提供了一种基于知识蒸馏技术的新的地方培训规范,以降低客户的过度问题,大大提升了培训计划的业绩。我们从理论上证明了拟议汇总优于基准的FedAvg。 对比了我们提议的16标准的联邦数据系统对16标准数据方法的精确性进行了对比。