We present \emph{TabRet}, a pre-trainable Transformer-based model for tabular data. TabRet is designed to work on a downstream task that contains columns not seen in pre-training. Unlike other methods, TabRet has an extra learning step before fine-tuning called \emph{retokenizing}, which calibrates feature embeddings based on the masked autoencoding loss. In experiments, we pre-trained TabRet with a large collection of public health surveys and fine-tuned it on classification tasks in healthcare, and TabRet achieved the best AUC performance on four datasets. In addition, an ablation study shows retokenizing and random shuffle augmentation of columns during pre-training contributed to performance gains. The code is available at https://github.com/pfnet-research/tabret .
翻译:我们提出了一种名为 \emph{TabRet} 的表格数据可预训练 Transformer 模型。TabRet 的设计目的是处理包含预训练中未见列的下游任务。与其他方法不同的是,TabRet 在微调之前有一个额外的学习步骤,称为 \emph{retokenizing},它基于掩码自编码损失对特征嵌入进行校准。在实验中,我们使用了大量的公共卫生调查数据对 TabRet 进行了预训练,并在医疗保健分类任务中进行了微调,TabRet 在四个数据集中均获得了最佳的 AUC 表现。此外,一个消融研究表明,预训练期间的重标记化和随机洗牌增强对于性能提升起到了积极作用。该代码可在 https://github.com/pfnet-research/tabret 上获取。