A successful machine learning (ML) algorithm often relies on a large amount of high-quality data to train well-performed models. Supervised learning approaches, such as deep learning techniques, generate high-quality ML functions for real-life applications, however with large costs and human efforts to label training data. Recent advancements in federated learning (FL) allow multiple data owners or organisations to collaboratively train a machine learning model without sharing raw data. In this light, vertical FL allows organisations to build a global model when the participating organisations have vertically partitioned data. Further, in the vertical FL setting the participating organisation generally requires fewer resources compared to sharing data directly, enabling lightweight and scalable distributed training solutions. However, privacy protection in vertical FL is challenging due to the communication of intermediate outputs and the gradients of model update. This invites adversary entities to infer other organisations underlying data. Thus, in this paper, we aim to explore how to protect the privacy of individual organisation data in a differential privacy (DP) setting. We run experiments with different real-world datasets and DP budgets. Our experimental results show that a trade-off point needs to be found to achieve a balance between the vertical FL performance and privacy protection in terms of the amount of perturbation noise.
翻译:成功的机器学习(ML)算法往往依靠大量高质量的数据来培训完善的模型。 受监督的学习方法,如深层学习技术,为实际应用产生高质量的 ML 功能,但成本巨大,而且人为努力为培训数据贴标签。 联邦学习(FL)最近的进展允许多个数据所有者或组织合作培训一个机器学习模式,而不共享原始数据。 从这个角度看,纵向FL允许各组织在参与组织有纵向分割数据时建立一个全球模型。 此外,在纵向FL设置中,参与组织一般需要较少的资源来直接共享数据,使轻量和可缩放的分布式培训解决方案。然而,纵向FL的隐私保护由于中间产出的交流和模型更新的梯度而具有挑战性。这请对立实体推导出其他组织的基础数据。 因此,在本文件中,我们的目的是探索如何在不同的隐私(DP)设置中保护单个组织数据的隐私。 我们用不同的真实的数据集和DP预算进行实验。 我们的实验结果显示,在垂直的保密性水平上,需要找到一个交易点,以便实现垂直的平衡。