The protection of private information is a crucial issue in data-driven research and business contexts. Typically, techniques like anonymisation or (selective) deletion are introduced in order to allow data sharing, e. g. in the case of collaborative research endeavours. For use with anonymisation techniques, the $k$-anonymity criterion is one of the most popular, with numerous scientific publications on different algorithms and metrics. Anonymisation techniques often require changing the data and thus necessarily affect the results of machine learning models trained on the underlying data. In this work, we conduct a systematic comparison and detailed investigation into the effects of different $k$-anonymisation algorithms on the results of machine learning models. We investigate a set of popular $k$-anonymisation algorithms with different classifiers and evaluate them on different real-world datasets. Our systematic evaluation shows that with an increasingly strong $k$-anonymity constraint, the classification performance generally degrades, but to varying degrees and strongly depending on the dataset and anonymisation method. Furthermore, Mondrian can be considered as the method with the most appealing properties for subsequent classification.
翻译:保护私人信息是数据驱动的研究和商业环境中的一个关键问题。通常,为了进行数据共享,例如合作研究工作,采用匿名或(选择性)删除等技术,以便进行数据共享。关于匿名技术,美元匿名标准是最受欢迎的标准之一,有许多关于不同算法和度量的科学出版物。匿名技术往往需要改变数据,从而必然影响根据基本数据培训的机器学习模型的结果。在这项工作中,我们系统比较和详细调查不同美元匿名算法对机器学习模型结果的影响。我们调查了一套与不同分类者通用的美元匿名算法,并在不同的真实世界数据集中对其进行评估。我们的系统评估表明,由于美元-匿名制约日益强烈,分类性能通常会降低,但程度不同,而且在很大程度上取决于数据集和地名化方法。此外,Mondrian可以被视为最有吸引力的随后分类方法。