Data synthesis is a privacy enhancing technology aiming to produce realistic and timely data when real data is hard to obtain. Utility of synthetic data generators (SDGs) has been investigated through different utility metrics. These metrics have been found to generate conflicting conclusions making direct comparison of SDGs surprisingly difficult. Moreover, prior research found no correlation between popular metrics, concluding they tackle different utility-dimensions. This paper aggregates four popular utility metrics (representing different utility dimensions) into one using principal-component-analysis and checks whether the new measure can generate synthetic data that perform well in real-life. The new measure is used to compare four well-recognized SDGs.
翻译:数据合成是一种增进隐私的技术,目的是在难以获得真实数据时产生现实和及时的数据,合成数据生成器(SDGs)的效用通过不同的通用指标进行了调查,发现这些指标得出了相互矛盾的结论,使得直接比较SDGs极为困难。此外,以前的研究发现流行指标之间没有关联,结论是不同的通用指标。本文将四种通用的通用指标(代表不同的通用方面)合并为一种指标,使用主要组成部分分析和检查新措施是否能够产生在现实生活中运行良好的合成数据。新措施用来比较四个公认的SDGs。