Data plays a crucial role in machine learning. However, in real-world applications, there are several problems with data, e.g., data are of low quality; a limited number of data points lead to under-fitting of the machine learning model; it is hard to access the data due to privacy, safety and regulatory concerns. \textit{Synthetic data generation} offers a promising new avenue, as it can be shared and used in ways that real-world data cannot. This paper systematically reviews the existing works that leverage machine learning models for synthetic data generation. Specifically, we discuss the synthetic data generation works from several perspectives: (i) applications, including computer vision, speech, natural language, healthcare, and business; (ii) machine learning methods, particularly neural network architectures and deep generative models; (iii) privacy and fairness issue. In addition, we identify the challenges and opportunities in this emerging field and suggest future research directions.
翻译:然而,在实际应用中,数据存在一些问题,例如数据质量低;数据点数量有限,导致机器学习模式不完善;由于隐私、安全和监管问题,很难获取数据。\textit{合成数据生成}提供了一个有希望的新途径,因为它可以以现实世界数据无法分享和使用的方式分享和使用。本文系统地审查了利用机器学习模型生成合成数据的现有工作。具体地说,我们从几个角度讨论了合成数据生成工作:(一) 应用,包括计算机视觉、语音、自然语言、保健和商业;(二) 机器学习方法,特别是神经网络结构和深层基因化模型;(三) 隐私和公平问题。此外,我们确定这个新兴领域的挑战和机遇,并提出未来的研究方向。