Diffusion-based tabular data synthesis models have yielded promising results. However, when the data dimensionality increases, existing models tend to degenerate and may perform even worse than simpler, non-diffusion-based models. This is because limited training samples in high-dimensional space often hinder generative models from capturing the distribution accurately. To mitigate the insufficient learning signals and to stabilize training under such conditions, we propose CtrTab, a condition-controlled diffusion model that injects perturbed ground-truth samples as auxiliary inputs during training. This design introduces an implicit L2 regularization on the model's sensitivity to the control signal, improving robustness and stability in high-dimensional, low-data scenarios. Experimental results across multiple datasets show that CtrTab outperforms state-of-the-art models, with a performance gap in accuracy over 90% on average.
翻译:基于扩散的表格数据合成模型已展现出有前景的结果。然而,当数据维度增加时,现有模型往往性能退化,甚至可能表现不如更简单的非扩散模型。这是因为高维空间中有限的训练样本通常会阻碍生成模型准确捕捉数据分布。为缓解学习信号不足的问题并在该条件下稳定训练,我们提出了CtrTab——一种条件控制的扩散模型,其在训练过程中注入受扰动的真实样本作为辅助输入。该设计引入了对模型控制信号敏感度的隐式L2正则化,从而提升了高维低数据场景下的鲁棒性与稳定性。跨多个数据集的实验结果表明,CtrTab优于现有先进模型,其平均准确率性能差距超过90%。