Tabular data synthesis is a long-standing research topic in machine learning. Many different methods have been proposed over the past decades, ranging from statistical methods to deep generative methods. However, it has not always been successful due to the complicated nature of real-world tabular data. In this paper, we present a new model named Score-based Tabular data Synthesis (STaSy) and its training strategy based on the paradigm of score-based generative modeling. Despite the fact that score-based generative models have resolved many issues in generative models, there still exists room for improvement in tabular data synthesis. Our proposed training strategy includes a self-paced learning technique and a fine-tuning strategy, which further increases the sampling quality and diversity by stabilizing the denoising score matching training. Furthermore, we also conduct rigorous experimental studies in terms of the generative task trilemma: sampling quality, diversity, and time. In our experiments with 15 benchmark tabular datasets and 7 baselines, our method outperforms existing methods in terms of task-dependant evaluations and diversity.
翻译:图表数据合成是机器学习的长期研究课题,在过去几十年中提出了许多不同的方法,从统计方法到深层基因化方法,但是,由于真实世界的表层数据性质复杂,它并不总是成功。在本文中,我们提出了一个新的模型,名为基于分数的表层数据合成(STaSy)及其基于以分数为基础的基因化模型模式模式模式的培训战略。尽管基于分数的基因化模型解决了基因化模型中的许多问题,但在表格数据合成方面仍有改进的余地。我们拟议的培训战略包括一种自我节奏学习技术和微调战略,通过稳定分数匹配培训,进一步提高抽样质量和多样性。此外,我们还在基因化任务三角任务:抽样质量、多样性和时间方面进行了严格的实验研究。在我们以15个基准表数据集和7个基线进行的实验中,我们的方法在任务依赖性评估和多样性方面超越了现有的方法。