As the privacy of machine learning has drawn increasing attention, federated learning is introduced to enable collaborative learning without revealing raw data. Notably, \textit{vertical federated learning} (VFL), where parties share the same set of samples but only hold partial features, has a wide range of real-world applications. However, existing studies in VFL rarely study the ``record linkage'' process. They either design algorithms assuming the data from different parties have been linked or use simple linkage methods like exact-linkage or top1-linkage. These approaches are unsuitable for many applications, such as the GPS location and noisy titles requiring fuzzy matching. In this paper, we design a novel similarity-based VFL framework, FedSim, which is suitable for more real-world applications and achieves higher performance on traditional VFL tasks. Moreover, we theoretically analyze the privacy risk caused by sharing similarities. Our experiments on three synthetic datasets and five real-world datasets with various similarity metrics show that FedSim consistently outperforms other state-of-the-art baselines.
翻译:随着机器学习的隐私日益引起注意,联谊学习被引入,以便在不透露原始数据的情况下进行协作学习。值得注意的是,当各方共用相同样本但只具有部分特征时,联盟学习(VFL)具有广泛的现实应用。然而,VFL的现有研究很少研究“记录链接”过程。它们要么假设来自不同当事方的数据是相互联系的,要么使用精确链接或顶层链接等简单联系方法。这些方法不适合许多应用,例如全球定位系统位置和需要模糊匹配的吵闹标题。在本文件中,我们设计了一个基于类似功能的VFLL框架,即FedSim,它适合更真实的应用,在传统的VFLF任务上取得更高的业绩。此外,我们从理论上分析了由于共享相似性而产生的隐私风险。我们在三个合成数据集和五个真实世界数据集上进行的实验显示,FedSim一直超越其他最先进的基线。