In many applications, multiple parties have private data regarding the same set of users but on disjoint sets of attributes, and a server wants to leverage the data to train a model. To enable model learning while protecting the privacy of the data subjects, we need vertical federated learning (VFL) techniques, where the data parties share only information for training the model, instead of the private data. However, it is challenging to ensure that the shared information maintains privacy while learning accurate models. To the best of our knowledge, the algorithm proposed in this paper is the first practical solution for differentially private vertical federated k-means clustering, where the server can obtain a set of global centers with a provable differential privacy guarantee. Our algorithm assumes an untrusted central server that aggregates differentially private local centers and membership encodings from local data parties. It builds a weighted grid as the synopsis of the global dataset based on the received information. Final centers are generated by running any k-means algorithm on the weighted grid. Our approach for grid weight estimation uses a novel, light-weight, and differentially private set intersection cardinality estimation algorithm based on the Flajolet-Martin sketch. To improve the estimation accuracy in the setting with more than two data parties, we further propose a refined version of the weights estimation algorithm and a parameter tuning strategy to reduce the final k-means utility to be close to that in the central private setting. We provide theoretical utility analysis and experimental evaluation results for the cluster centers computed by our algorithm and show that our approach performs better both theoretically and empirically than the two baselines based on existing techniques.
翻译:在许多应用中,多个缔约方拥有关于同一组用户的私人数据,但关于不连接的属性组合,服务器想要利用数据来培训模型。为了在保护数据主题隐私的同时进行模型学习,我们需要垂直联合学习技术,数据缔约方只能为培训模型而共享信息,而不是私人数据。然而,确保共享信息保持隐私,同时学习准确模型是具有挑战性的。根据我们的最佳知识,本文件中提议的算法是使用差异性私人垂直联合基值组合的第一个实际解决方案,服务器可以在此获得一套具有可辨别的保密保障的全球中心。为了能够使模型学习模型能够同时保护数据主题的隐私,我们需要垂直联合学习技术,数据缔约方需要垂直联合学习技术,数据缔约方只能为培训模型共享信息,而不是私人数据。然而,最终中心是通过在加权电网中运行任何 k- 比例算法来生成的。 我们的电网重量估算方法使用新颖、轻度和差异性私基点估算法,其中服务器可以获得一套具有可辨别的通用中心中心中心中心中心中心中心中心中心中心。我们的算法假设假设是一个不受信任的中央服务器服务器服务器服务器,而我们根据弗拉乔·基数计算法进行更精确的计算。