Consider two data providers that want to contribute data to a certain learning model. Recent works have shown that the value of the data of one of the providers is dependent on the similarity with the data owned by the other provider. It would thus be beneficial if the two providers can calculate the similarity of their data, while keeping the actual data private. In this work, we devise multiparty computation-protocols to compute similarity of two data sets based on correlation, while offering controllable privacy guarantees. We consider a simple model with two participating providers and develop methods to compute exact and approximate correlation, respectively, with controlled information leakage. Both protocols have computational and communication complexities that are linear in the number of data samples. We also provide general bounds on the maximal error in the approximation case, and analyse the resulting errors for practical parameter choices.
翻译:假设有两个数据提供者想要为某个学习模型贡献数据。最近的研究表明,其中一个提供者的数据价值取决于其数据与另一个提供者拥有的数据的相似性。因此,如果两个提供者可以计算其数据的相似性,同时保持实际数据的私密性,将是有益的。在这项工作中,我们设计了多方计算协议,以基于相关性计算两个数据集的相似性,同时提供可控的隐私保障。我们考虑了一个简单的模型,有两个参与提供者,并开发了精确和近似相关性的计算方法,分别具有可控的信息泄露。两个协议的计算和通信复杂度都与数据样本数量成线性关系。我们还提供最大误差的普遍界限,并分析了实际参数选择的结果误差。