Recent work on deep learning for tabular data demonstrates the strong performance of deep tabular models, often bridging the gap between gradient boosted decision trees and neural networks. Accuracy aside, a major advantage of neural models is that they learn reusable features and are easily fine-tuned in new domains. This property is often exploited in computer vision and natural language applications, where transfer learning is indispensable when task-specific training data is scarce. In this work, we demonstrate that upstream data gives tabular neural networks a decisive advantage over widely used GBDT models. We propose a realistic medical diagnosis benchmark for tabular transfer learning, and we present a how-to guide for using upstream data to boost performance with a variety of tabular neural network architectures. Finally, we propose a pseudo-feature method for cases where the upstream and downstream feature sets differ, a tabular-specific problem widespread in real-world applications. Our code is available at https://github.com/LevinRoman/tabular-transfer-learning .
翻译:最近关于深入学习表格数据的工作显示了深层表格模型的强劲表现,这些模型往往缩小了梯度驱动决策树和神经网络之间的差距。准确的一面是,神经模型的一个主要优势是它们学会了可再使用的特点,而且很容易在新的领域进行微调。这种属性常常在计算机视野和自然语言应用中被利用,在特定任务的培训数据稀少时,转让学习是必不可少的。在这项工作中,我们证明上游数据使表层神经网络相对于广泛使用的GBDT模型具有决定性优势。我们为表格传输学习提出了一个现实的医学诊断基准,我们提出了如何使用上游数据来利用各种表层神经网络结构提高性能的指南。最后,我们为上游和下游特征组不同的案例提出了一种假算法方法,这是一个在现实世界应用中普遍存在的表格特定问题。我们的代码可以在 https://github.com/LevinRoman/tabular-trans-legleininging 中查阅。