Recently, federated learning has emerged as a promising approach for training a global model using data from multiple organizations without leaking their raw data. Nevertheless, directly applying federated learning to real-world tasks faces two challenges: (1) heterogeneity in the data among different organizations; and (2) data noises inside individual organizations. In this paper, we propose a general framework to solve the above two challenges simultaneously. Specifically, we propose using distributionally robust optimization to mitigate the negative effects caused by data heterogeneity paradigm to sample clients based on a learnable distribution at each iteration. Additionally, we observe that this optimization paradigm is easily affected by data noises inside local clients, which has a significant performance degradation in terms of global model prediction accuracy. To solve this problem, we propose to incorporate mixup techniques into the local training process of federated learning. We further provide comprehensive theoretical analysis including robustness analysis, convergence analysis, and generalization ability. Furthermore, we conduct empirical studies across different drug discovery tasks, such as ADMET property prediction and drug-target affinity prediction.
翻译:最近,联合会学习已成为一种很有希望的方法,用于培训使用多个组织的数据而无需泄露其原始数据的全球模型,然而,将联合会学习直接应用于现实世界任务,面临两个挑战:(1) 不同组织的数据差异;(2) 个别组织内部的数据噪音;我们在本文件中提议一个总体框架,同时解决上述两个挑战;具体地说,我们提议根据每代的可学习分布,使用分布式强力优化模式,减轻数据差异性模式给抽样客户造成的消极影响;此外,我们注意到,这种优化模式很容易受到当地客户内部数据噪音的影响,从全球模型预测的准确性看,这些数据噪音的性能严重退化;为解决这一问题,我们提议将混合技术纳入联合会学习的当地培训过程;我们进一步提供全面的理论分析,包括稳健分析、趋同分析和普及能力;此外,我们还在不同药物发现任务中进行实证研究,例如AMET财产预测和药物目标接近性预测。