Federated learning is a learning paradigm to enable collaborative learning across different parties without revealing raw data. Notably, vertical federated learning (VFL), where parties share the same set of samples but only hold partial features, has a wide range of real-world applications. However, most existing studies in VFL disregard the "record linkage" process. They design algorithms either assuming the data from different parties can be exactly linked or simply linking each record with its most similar neighboring record. These approaches may fail to capture the key features from other less similar records. Moreover, such improper linkage cannot be corrected by training since existing approaches provide no feedback on linkage during training. In this paper, we design a novel coupled training paradigm, FedSim, that integrates one-to-many linkage into the training process. Besides enabling VFL in many real-world applications with fuzzy identifiers, FedSim also achieves better performance in traditional VFL tasks. Moreover, we theoretically analyze the additional privacy risk incurred by sharing similarities. Our experiments on eight datasets with various similarity metrics show that FedSim outperforms other state-of-the-art baselines. The codes of FedSim are available at https://github.com/Xtra-Computing/FedSim.
翻译:联邦学习是一种学习模式,可以使不同缔约方在不透露原始数据的情况下进行合作学习。 值得注意的是,纵向联合学习(VFL),即各方共享相同的样本,但只有部分特征,具有广泛的现实应用。然而,VFL的大多数现有研究忽视了“记录联系”进程。它们设计算法,要么假设来自不同缔约方的数据可以完全链接,要么简单地将每个记录与其最相似的相邻记录联系起来。这些方法可能无法从其他类似记录中捕捉到关键特征。此外,这种不适当的联系无法通过培训得到纠正,因为现有方法没有提供关于培训中联系的反馈。在本文件中,我们设计了一个新颖的组合培训模式FedSim,将一对一连接纳入培训进程。除了让VFLLF在许多现实应用中能够使用模糊标识器之外,FedSim还可以在传统的VFLL任务中取得更好的性能。此外,我们从理论上分析了由于共享相似性而产生的额外隐私风险。我们关于八种类似指标的实验显示FedSimim/Comim 的代码在其它州-FAT基准中超过了其他州/ 。