Self-supervised learning has been shown to be very effective in learning useful representations, and yet much of the success is achieved in data types such as images, audio, and text. The success is mainly enabled by taking advantage of spatial, temporal, or semantic structure in the data through augmentation. However, such structure may not exist in tabular datasets commonly used in fields such as healthcare, making it difficult to design an effective augmentation method, and hindering a similar progress in tabular data setting. In this paper, we introduce a new framework, Subsetting features of Tabular data (SubTab), that turns the task of learning from tabular data into a multi-view representation learning problem by dividing the input features to multiple subsets. We argue that reconstructing the data from the subset of its features rather than its corrupted version in an autoencoder setting can better capture its underlying latent representation. In this framework, the joint representation can be expressed as the aggregate of latent variables of the subsets at test time, which we refer to as collaborative inference. Our experiments show that the SubTab achieves the state of the art (SOTA) performance of 98.31% on MNIST in tabular setting, on par with CNN-based SOTA models, and surpasses existing baselines on three other real-world datasets by a significant margin.
翻译:自我监督的学习在学习有用的表现方式方面证明非常有效,但许多成功是在图像、音频和文本等数据类型中取得的。成功主要得益于通过扩增在数据中利用空间、时间或语义结构。然而,在保健等领域常用的表格数据集中可能不存在这种结构,因此难以设计有效的增强方法,并阻碍表格数据设置的类似进展。在本文中,我们引入了一个新的框架,即塔布数据(SubTab)的子设置特征,通过将输入特征分为多个子集,将学习数据的任务从表列数据转变为多视角学习问题。我们认为,将数据从其特性的子集而不是在自动编码环境中的腐败版本加以重建,可以更好地捕捉其潜在的潜在代表性。在这个框架内,联合表述可以作为测试时子集潜在变量的汇总,我们称之为协作推论。我们的实验显示, SubTab实现了从列表数据中从列表数据学到多视角的学习问题,将输入特征特性分为多个子组。我们认为,从数据特性组中重建数据,而不是从自动编码中重建数据组合中的数据,通过在SO31标准中的现有三个基底基级模型,在SOMISIST的基底基底基底基底基底基底座上实现了。