Federated Learning by nature is susceptible to low-quality, corrupted, or even malicious data that can severely degrade the quality of the learned model. Traditional techniques for data valuation cannot be applied as the data is never revealed. We present a novel technique for filtering, and scoring data based on a practical influence approximation that can be implemented in a privacy-preserving manner. Each agent uses his own data to evaluate the influence of another agent's batch, and reports to the center an obfuscated score using differential privacy. Our technique allows for almost perfect ($>92\%$ recall) filtering of corrupted data in a variety of applications using real-data. Importantly, the accuracy does not degrade significantly, even under really strong privacy guarantees ($\varepsilon \leq 1$), especially under realistic percentages of mislabeled data (for $15\%$ mislabeled data we only lose $10\%$ in accuracy).
翻译:自然而然的联邦学习组织容易获得低质量、腐败甚至恶意的数据,这些数据可能严重降低所学模型的质量。数据评估的传统技术无法应用,因为数据从未披露。我们展示了一种基于实际影响力近似法的过滤和评分新技术,这些数据可以以隐私保护方式实施。每个代理商都使用自己的数据来评估另一代理商的批量影响,并利用不同的隐私向中心报告一个模糊的分数。我们的技术允许在使用真实数据的各种应用中几乎完美地过滤腐败数据。重要的是,准确性并没有显著下降,即使在真正有力的隐私保障下($\varepsilon\leq 1美元),特别是错误标签数据的实际百分比(我们只损失了1,500美元误标签数据)下的数据(我们在准确性方面只损失了1,0美元)。