Missing value imputation in machine learning is the task of estimating the missing values in the dataset accurately using available information. In this task, several deep generative modeling methods have been proposed and demonstrated their usefulness, e.g., generative adversarial imputation networks. Recently, diffusion models have gained popularity because of their effectiveness in the generative modeling task in images, texts, audio, etc. To our knowledge, less attention has been paid to the investigation of the effectiveness of diffusion models for missing value imputation in tabular data. Based on recent development of diffusion models for time-series data imputation, we propose a diffusion model approach called "Conditional Score-based Diffusion Models for Tabular data" (CSDI_T). To effectively handle categorical variables and numerical variables simultaneously, we investigate three techniques: one-hot encoding, analog bits encoding, and feature tokenization. Experimental results on benchmark datasets demonstrated the effectiveness of CSDI_T compared with well-known existing methods, and also emphasized the importance of the categorical embedding techniques.
翻译:机器学习中缺失的估算值是利用现有信息准确估计数据集缺失值的任务。在这项任务中,提出了几种深层次的基因模型方法,并展示了这些方法的有用性,例如:基因对抗估算网络。最近,传播模型因其在图像、文本、音频等的基因模型任务中的有效性而越来越受欢迎。据我们所知,较少注意对表格数据中缺失值估算的传播模型的有效性的调查。根据最近开发的时间序列数据估算传播模型,我们提出了一种称为“基于条件的分数数据滴入模型”的推广模型方法(CDI_T)。为了同时有效地处理绝对变量和数字变量,我们研究了三种技术:一热编码、模拟比特编码和特征符号化。基准数据集的实验结果表明,CNI_T与众所周知的现有方法相比是有效的,我们还强调了绝对嵌入技术的重要性。