Big data analysis poses the dual problem of privacy preservation and utility, i.e., how accurate data analyses remain after transforming original data in order to protect the privacy of the individuals that the data is about - and whether they are accurate enough to be meaningful. In this paper, we thus investigate across several datasets whether different methods of generating fully synthetic data vary in their utility a priori (when the specific analyses to be performed on the data are not known yet), how closely their results conform to analyses on original data a posteriori, and whether these two effects are correlated. We find some methods (decision-tree based) to perform better than others across the board, sizeable effects of some choices of imputation parameters (notably the number of released datasets), no correlation between broad utility metrics and analysis accuracy, and varying correlations for narrow metrics. We did get promising findings for classification tasks when using synthetic data for training machine learning models, which we consider worth exploring further also in terms of mitigating privacy attacks against ML models such as membership inference and model inversion.
翻译:大数据分析提出了隐私保护和效用的双重问题,即:为了保护数据所涉及的个人的隐私,为了保护数据所涉及的个人隐私而转换原始数据之后,准确的数据分析如何保持准确性,以及数据是否足够准确,是否有意义;因此,在本文件中,我们通过几个数据集调查不同方法生成完全合成数据的效用是否先验地有所不同(当对数据进行具体分析尚不为人知时),其结果与对原始数据事后分析的密切程度,以及这两个影响是否相关。我们发现一些方法(基于决定的树木)比其他方法效果更好,一些估算参数的选择(特别是发布数据集的数量)产生了相当大的影响,广义的实用性指标和分析准确性之间没有关联,狭隘指标的关联性也各不相同。在使用合成数据来培训机器学习模型时,我们确实获得了关于分类任务的有希望的结论,我们认为,在减少对ML模型的隐私攻击方面,例如会籍推论和模型转换,也值得进一步探讨。