Denoising diffusion probabilistic models are currently becoming the leading paradigm of generative modeling for many important data modalities. Being the most prevalent in the computer vision community, diffusion models have also recently gained some attention in other domains, including speech, NLP, and graph-like data. In this work, we investigate if the framework of diffusion models can be advantageous for general tabular problems, where datapoints are typically represented by vectors of heterogeneous features. The inherent heterogeneity of tabular data makes it quite challenging for accurate modeling, since the individual features can be of completely different nature, i.e., some of them can be continuous and some of them can be discrete. To address such data types, we introduce TabDDPM -- a diffusion model that can be universally applied to any tabular dataset and handles any type of feature. We extensively evaluate TabDDPM on a wide set of benchmarks and demonstrate its superiority over existing GAN/VAE alternatives, which is consistent with the advantage of diffusion models in other fields. Additionally, we show that TabDDPM is eligible for privacy-oriented setups, where the original datapoints cannot be publicly shared.
翻译:目前,在很多重要数据模式中,传播模型作为最普遍的计算机视觉界,最近也在包括语音、NLP和图表类数据在内的其他领域引起了一些关注。在这项工作中,我们调查传播模型框架是否有利于一般的表格问题,因为数据点通常由多种特征的矢量代表。表格数据的内在异质性使得精确模型具有相当大的挑战性,因为个别特征可能具有完全不同的性质,即其中某些特征可以是连续的,某些特征可以是分散的。为了处理这类数据类型,我们引入了TabDDPM -- -- 一种可普遍应用于任何表格数据集和处理任何特征的传播模型。我们广泛评价了一套广泛的基准,并表明TabDDPMM优于现有的GAN/VAE替代物,这与其他领域的传播模型的优势是一致的。此外,我们表明TabDDPMD有资格进行以隐私为导向的设置,而原始数据点无法公开分享。