Federated Learning (FL) is a distributed machine learning paradigm where clients collaboratively train a model using their local (human-generated) datasets while preserving privacy. While existing studies focus on FL algorithm development to tackle data heterogeneity across clients, the important issue of data quality (e.g., label noise) in FL is overlooked. This paper aims to fill this gap by providing a quantitative study on the impact of label noise on FL. Theoretically speaking, we derive an upper bound for the generalization error that is linear in the clients' label noise level. Empirically speaking, we conduct experiments on MNIST and CIFAR-10 datasets using various FL algorithms. We show that the global model accuracy linearly decreases as the noise level increases, which is consistent with our theoretical analysis. We further find that label noise slows down the convergence of FL training, and the global model tends to overfit when the noise level is high.
翻译:联邦学习组织(FL)是一个分布式的机器学习模式,客户利用本地(人造)数据集合作培训模型,同时保护隐私。虽然现有研究侧重于FL算法开发,以解决客户之间数据差异性的问题,但FL数据质量的重要问题(例如标签噪音)被忽视。本文的目的是通过提供标签噪音对FL的影响的定量研究来填补这一差距。理论上讲,我们得出了一个通用错误的上限,即客户标签噪音水平线性错误。就典型而言,我们利用各种FL算法对MNIST和CIFAR-10数据集进行实验。我们表明,随着噪音水平的提高,全球模型的准确性会随着噪音水平的提高而逐渐下降,这与我们的理论分析是一致的。我们进一步发现,标签噪音减缓了FL培训的趋同速度,而全球模型在噪音水平高时往往过于适合。