Recent literature in self-supervised has demonstrated significant progress in closing the gap between supervised and unsupervised methods in the image and text domains. These methods rely on domain-specific augmentations that are not directly amenable to the tabular domain. Instead, we introduce Contrastive Mixup, a semi-supervised learning framework for tabular data and demonstrate its effectiveness in limited annotated data settings. Our proposed method leverages Mixup-based augmentation under the manifold assumption by mapping samples to a low dimensional latent space and encourage interpolated samples to have high a similarity within the same labeled class. Unlabeled samples are additionally employed via a transductive label propagation method to further enrich the set of similar and dissimilar pairs that can be used in the contrastive loss term. We demonstrate the effectiveness of the proposed framework on public tabular datasets and real-world clinical datasets.
翻译:最近自我监督的文献表明,在缩小图像和文本域中受监督和不受监督的方法之间的差距方面取得了显著进展,这些方法依靠的是不直接适合表格域的域特定增强。相反,我们引入了半受监督的表格数据学习框架,即半受监督的混集技术,在有限的附加说明的数据设置中显示了其有效性。我们提议的方法在多重假设下利用混集法增强能力,通过对样本进行测绘,将其定位为低维潜潜层空间,鼓励内插样本在同一标签类中具有高度相似性。此外,通过转基因标签传播方法还使用未贴标签的样本,以进一步丰富在对比损失术语中可以使用的相似和不同配对。我们展示了公共表格数据集和真实世界临床数据集拟议框架的有效性。