Generating synthetic data through generative models is gaining interest in the ML community and beyond. In the past, synthetic data was often regarded as a means to private data release, but a surge of recent papers explore how its potential reaches much further than this -- from creating more fair data to data augmentation, and from simulation to text generated by ChatGPT. In this perspective we explore whether, and how, synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs. Just as importantly, we discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data -- the most important of which is quantifying how much we can trust any finding or prediction drawn from synthetic data.
翻译:通过生成模型生成合成数据在机器学习社区和其他领域中备受关注。过去,合成数据通常被视为保护隐私的手段,但最近的一系列论文探讨了它的潜力远远不止于此——从创建更加公平的数据、数据增强、模拟到 ChatGPT 生成的文本等等。在这篇论文的角度中,我们探讨了合成数据是否及如何成为机器学习世界的主导力量,承诺一个可以根据个人需求定制数据集的未来。同样重要的是,我们讨论了社区需要克服哪些根本性挑战,以便更广泛地应用合成数据——其中最重要的是量化我们可以信任从合成数据中得出的任何发现或预测的程度。