While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear. We contribute extensive benchmarks of standard and novel deep learning methods as well as tree-based models such as XGBoost and Random Forests, across a large number of datasets and hyperparameter combinations. We define a standard set of 45 datasets from varied domains with clear characteristics of tabular data and a benchmarking methodology accounting for both fitting models and finding good hyperparameters. Results show that tree-based models remain state-of-the-art on medium-sized data ($\sim$10K samples) even without accounting for their superior speed. To understand this gap, we conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs). This leads to a series of challenges which should guide researchers aiming to build tabular-specific NNs: 1. be robust to uninformative features, 2. preserve the orientation of the data, and 3. be able to easily learn irregular functions. To stimulate research on tabular architectures, we contribute a standard benchmark and raw data for baselines: every point of a 20 000 compute hours hyperparameter search for each learner.
翻译:虽然深层次的学习使文本和图像数据集取得了巨大进展,但其在表层数据上的优越性并不明显。我们通过大量数据集和超参数组合,为标准和创新的深层次学习方法以及XGBoost和随机森林等基于树的模型提供了广泛的基准,例如XGBoost和随机森林。我们定义了一套标准的45个数据集,这些数据集来自不同领域,具有表层数据的明确特点,并有一个基准方法,既考虑到适当的模型,又考虑到良好的超参数。结果显示,以树为基础的模型即使在没有计算其超快速度的情况下,仍然是最先进的中等规模数据($sim$10K样本)。为了理解这一差距,我们对基于树的模型和神经网络(NNS)的不同感知偏差进行了实证性调查。这导致了一系列挑战,应当指导研究人员建立表格专用的NNF:1. 强于不具有说明性的特性,2. 维护数据的方向,以及3. 能够容易地学习不规则的功能。为了刺激对表格结构的研究,我们为基线贡献了标准基准基准基准基准基准基准基准基准基准基准和原始数据:每20 000个基数的每点。