Federated learning is a learning paradigm to enable collaborative learning across different parties without revealing raw data. Notably, \textit{vertical federated learning} (VFL), where parties share the same set of samples but only hold partial features, has a wide range of real-world applications. However, most existing studies in VFL disregard the "record linkage" process. They design algorithms either assuming the data from different parties can be exactly linked or simply linking each record with its most similar neighboring record. These approaches may fail to capture the key features from other less similar records. Moreover, such improper linkage cannot be corrected by training since existing approaches provide no feedback on linkage during training. In this paper, we design a novel coupled training paradigm, FedSim, that integrates one-to-many linkage into the training process. Besides enabling VFL in many real-world applications with fuzzy identifiers, FedSim also achieves better performance in traditional VFL tasks. Moreover, we theoretically analyze the additional privacy risk incurred by sharing similarities. Our experiments on eight datasets with various similarity metrics show that FedSim outperforms other state-of-the-art baselines. The codes of FedSim are available at \url{https://github.com/Xtra-Computing/FedSim}.
翻译:联邦学习是一种学习范式,可以让不同缔约方在不披露原始数据的情况下进行协作学习。 值得注意的是, \ textit{ 纵向联合学习} (VFL), 各方共享相同的样本,但只有部分特征, 具有广泛的现实应用。 然而, VFL 的大多数现有研究忽视了“ 记录链接” 进程。 它们设计算法, 假设来自不同缔约方的数据可以完全链接, 或者简单地将每个记录与其最相似的相邻记录连接。 这些方法可能无法从其他不太相似的记录中捕捉到关键特征。 此外, 这种不适当的联系无法通过培训得到纠正, 因为现有方法在培训期间没有提供有关链接的反馈。 在本文中,我们设计了一个新的组合培训模式,即FedSim, 将一对一连接纳入培训进程。 除了使许多真实世界应用程序中的VFLFL能够使用模糊标识, FedSSimprecial也能在传统的VFFLL任务中取得更好的表现。 此外, 我们从理论上分析由于共享相似性而遭受的额外隐私风险。 我们关于八种类似指标的实验显示 FedSim-FID-FM_FM_Simpal_S_Simpat- simput_T_S_Syal burd_Simpal burd_xxxxxx_s_s_s_Simd_sxxxxxxxxxxxxxxxxxxx