Recent deep learning models for tabular data currently compete with the traditional ML models based on decision trees (GBDT). Unlike GBDT, deep models can additionally benefit from pretraining, which is a workhorse of DL for vision and NLP. For tabular problems, several pretraining methods were proposed, but it is not entirely clear if pretraining provides consistent noticeable improvements and what method should be used, since the methods are often not compared to each other or comparison is limited to the simplest MLP architectures. In this work, we aim to identify the best practices to pretrain tabular DL models that can be universally applied to different datasets and architectures. Among our findings, we show that using the object target labels during the pretraining stage is beneficial for the downstream performance and advocate several target-aware pretraining objectives. Overall, our experiments demonstrate that properly performed pretraining significantly increases the performance of tabular DL models, which often leads to their superiority over GBDTs.
翻译:与GBDT不同的是,深度模型还可以从预培训中获益,因为预培训是DL用于愿景和NLP的一匹工作马。 关于列表问题,提出了几种预培训方法,但并不完全清楚预培训是否提供了一致的显著改进,以及应当使用何种方法,因为方法往往不相互比较,或者比较仅限于最简单的MLP结构。在这项工作中,我们的目标是确定预培训表DL模型的最佳做法,这些模型可以普遍适用于不同的数据集和结构。我们的调查结果表明,在预培训阶段使用目标目标标签有利于下游业绩,并倡导若干目标性培训前目标。总体而言,我们的实验表明,适当进行预培训会大大提高表格DL模型的性能,这往往导致其优于GBDT。