Sharing data can often enable compelling applications and analytics. However, more often than not, valuable datasets contain information of sensitive nature, and thus sharing them can endanger the privacy of users and organizations. A possible alternative gaining momentum in the research community is to share synthetic data instead. The idea is to release artificially generated datasets that resemble the actual data -- more precisely, having similar statistical properties. So how do you generate synthetic data? What is that useful for? What are the benefits and the risks? What are the open research questions that remain unanswered? In this article, we provide a gentle introduction to synthetic data and discuss its use cases, the privacy challenges that are still unaddressed, and its inherent limitations as an effective privacy-enhancing technology.
翻译:共享数据往往能够带来令人信服的应用和分析。然而,有价值的数据集往往包含敏感性质的信息,因此共享这些信息会危及用户和组织隐私。研究界一个可能获得动力的替代办法是共享合成数据。其想法是释放与实际数据相似的人工生成的数据集 -- -- 更精确地说,这些数据具有类似的统计属性。你如何生成合成数据?这对什么有用?有哪些好处和风险?尚未解答的开放式研究问题是什么?在本篇文章中,我们温和地介绍合成数据并讨论其使用案例、仍未解决的隐私挑战及其作为有效增强隐私技术的内在局限性。</s>