In collaborative learning, multiple parties contribute their datasets to jointly deduce global machine learning models for numerous predictive tasks. Despite its efficacy, this learning paradigm fails to encompass critical application domains that involve highly sensitive data, such as healthcare and security analytics, where privacy risks limit entities to individually train models using only their own datasets. In this work, we target privacy-preserving collaborative hierarchical clustering. We introduce a formal security definition that aims to achieve the balance between utility and privacy and present a two-party protocol that provably satisfies it. We then extend our protocol with: (i) an optimized version for the single-linkage clustering, and (ii) scalable approximation variants. We implement all our schemes and experimentally evaluate their performance and accuracy on synthetic and real datasets, obtaining very encouraging results. For example, end-to-end execution of our secure approximate protocol for over 1M 10-dimensional data samples requires 35sec of computation and achieves 97.09% accuracy.
翻译:在合作学习中,多个当事方贡献数据集,共同推导全球机器学习模式,以完成许多预测性任务。尽管这种学习模式具有效力,但未能涵盖涉及高度敏感数据的关键应用领域,如保健和安全分析,隐私风险限制实体仅使用自己的数据集对模型进行单独培训。在这项工作中,我们的目标是保护隐私,保护协作性等级组合。我们引入了正式的安全定义,旨在实现公用和隐私之间的平衡,并提出了可以令人理解的两方协议。我们随后扩展了我们的协议,包括:(一) 单一链接组合的优化版本,以及(二) 可缩放的近似变量。我们实施所有计划,并实验性地评估其在合成和真实数据集上的性能和准确性,并获得非常令人鼓舞的结果。例如,我们为1M 10维以上的数据样本实施安全的近似协议,需要35sec的计算和达到97.09%的准确度。