Federated learning is a learning paradigm to enable collaborative learning across different parties without revealing raw data. Notably, vertical federated learning (VFL), where parties share the same set of samples but only hold partial features, has a wide range of real-world applications. However, most existing studies in VFL disregard the "record linkage" process. They design algorithms either assuming the data from different parties can be exactly linked or simply linking each record with its most similar neighboring record. These approaches may fail to capture the key features from other less similar records. Moreover, such improper linkage cannot be corrected by training since existing approaches provide no feedback on linkage during training. In this paper, we design a novel coupled training paradigm, FedSim, that integrates one-to-many linkage into the training process. Besides enabling VFL in many real-world applications with fuzzy identifiers, FedSim also achieves better performance in traditional VFL tasks. Moreover, we theoretically analyze the additional privacy risk incurred by sharing similarities. Our experiments on eight datasets with various similarity metrics show that FedSim outperforms other state-of-the-art baselines. The codes of FedSim are available at https://github.com/Xtra-Computing/FedSim.
翻译:联邦学习是一种学习范例,它可以在不暴露原始数据的情况下促进不同参与方之间的协同学习。值得注意的是,垂直联邦学习(VFL)中,参与方共享同一组样本,但仅持有部分特征,具有广泛的实际应用。然而,大多数现有的VFL研究都忽略了“记录链接”过程。他们设计的算法要么假定不同参与方的数据可以被准确链接,要么仅将每个记录与其最相似的相邻记录链接。这些方法可能无法捕捉来自其他不太相似的记录的关键特征。此外,这种不当的链接无法通过训练进行纠正,因为现有方法在训练过程中对链接没有提供任何反馈。在本文中,我们设计了一种新的耦合训练范式FedSim,将一对多链接集成到训练过程中。除了在具有模糊标识符的许多实际应用中实现VFL外,FedSim还可以在传统VFL任务中实现更好的性能。此外,我们对共享相似性带来的额外隐私风险进行了理论分析。我们的八个数据集上的实验结果表明,FedSim优于其他最先进的基线。FedSim的代码可在https://github.com/Xtra-Computing/FedSim上找到。