Machine learning is increasingly used in the most diverse applications and domains, whether in healthcare, to predict pathologies, or in the financial sector to detect fraud. One of the linchpins for efficiency and accuracy in machine learning is data utility. However, when it contains personal information, full access may be restricted due to laws and regulations aiming to protect individuals' privacy. Therefore, data owners must ensure that any data shared guarantees such privacy. Removal or transformation of private information (de-identification) are among the most common techniques. Intuitively, one can anticipate that reducing detail or distorting information would result in losses for model predictive performance. However, previous work concerning classification tasks using de-identified data generally demonstrates that predictive performance can be preserved in specific applications. In this paper, we aim to evaluate the existence of a trade-off between data privacy and predictive performance in classification tasks. We leverage a large set of privacy-preserving techniques and learning algorithms to provide an assessment of re-identification ability and the impact of transformed variants on predictive performance. Unlike previous literature, we confirm that the higher the level of privacy (lower re-identification risk), the higher the impact on predictive performance, pointing towards clear evidence of a trade-off.
翻译:机器学习越来越多地用于最多样化的应用和领域,无论是在医疗保健、预测病理或金融部门,以发现欺诈。机器学习的效率和准确性的关键之一是数据效用。然而,如果机器学习中包含个人信息,则由于保护个人隐私的法律和法规,全面获取可能受到限制。因此,数据拥有者必须确保任何数据共享保障这种隐私。删除或转换私人信息(去认同)是最常见的技术之一。直觉地说,人们可以预计,减少详细信息或扭曲信息会导致模型预测性能的损失。然而,以往关于使用非确定数据进行分类的工作通常表明,在具体应用中可以保留预测性业绩。在本文件中,我们的目标是评估数据隐私隐私和分类性能预测性能之间存在的权衡。我们利用大量隐私保护技术和学习算法来评估再识别能力和变换变变量对预测性能的影响。我们与以往的文献不同,我们确认,隐私(低重新确定风险)水平较高,对预测性能的影响更大。