Tabular data (or tables) are the most widely used data format in machine learning (ML). However, ML models often assume the table structure keeps fixed in training and testing. Before ML modeling, heavy data cleaning is required to merge disparate tables with different columns. This preprocessing often incurs significant data waste (e.g., removing unmatched columns and samples). How to learn ML models from multiple tables with partially overlapping columns? How to incrementally update ML models as more columns become available over time? Can we leverage model pretraining on multiple distinct tables? How to train an ML model which can predict on an unseen table? To answer all those questions, we propose to relax fixed table structures by introducing a Transferable Tabular Transformer (TransTab) for tables. The goal of TransTab is to convert each sample (a row in the table) to a generalizable embedding vector, and then apply stacked transformers for feature encoding. One methodology insight is combining column description and table cells as the raw input to a gated transformer model. The other insight is to introduce supervised and self-supervised pretraining to improve model performance. We compare TransTab with multiple baseline methods on diverse benchmark datasets and five oncology clinical trial datasets. Overall, TransTab ranks 1.00, 1.00, 1.78 out of 12 methods in supervised learning, feature incremental learning, and transfer learning scenarios, respectively; and the proposed pretraining leads to 2.3% AUC lift on average over the supervised learning.
翻译:在机器学习(ML)中,塔布数据(或表格)是最广泛使用的数据格式。然而,ML模型往往假定表格结构在培训和测试中保持不变。在ML建模之前,需要大量的数据清理才能将不同的表格与不同的列合并。这种预处理往往产生大量的数据浪费(例如删除不匹配的柱子和样本)。如何从多张表格中学习ML模型(或表格),并带有部分重叠的柱子;如何随着更多的柱子逐渐获得,逐步更新ML模型?我们能否在多个不同的表格上进行模型预培训?如何培训一个能够在看不见的表格上预测的 ML模型?为了回答所有这些问题,我们提议通过对表格采用可转移的塔布变器(TransTab)来放松固定的表格结构结构。 TransTab的目标是将每个样本(表格中的一行)转换为可通用的嵌入矢量矢量矢量,然后将堆积变器用于特性编码。一种方法的洞察是将柱形描述和表格单元格作为原始平均输入到一个门式变形变形模型。另一个深入的洞察是引入监督和和自我校准的升前先导的模型,在模型中引入和自我监督和自我监督和自我校准的进度前阶段前导出所有的模型,将改进的学习模型1号,我们的模型,在改进的模型上,在改进的模型上,在模型上,在改进的学习模型上,在改进的模型上,我们的模型上,用。我们的模型上,我们的模型上,用。我们比较了不同的模型上,用五个比较了一个模型上,在模型上,用不同的模型上,用。我们的模型上的数据基的模型上,用。我们用不同的模型上,用的模型。我们用的模型。我们用的模型。