深神经网络和图表数据:调查 (Deep Neural Networks and Tabular Data: A Survey)

Heterogeneous tabular data are the most commonly used form of data and are essential for numerous critical and computationally demanding applications. On homogeneous data sets, deep neural networks have repeatedly shown excellent performance and have therefore been widely adopted. However, their adaptation to tabular data for inference or data generation tasks remains challenging. To facilitate further progress in the field, this work provides an overview of state-of-the-art deep learning methods for tabular data. We categorize these methods into three groups: data transformations, specialized architectures, and regularization models. For each of these groups, our work offers a comprehensive overview of the main approaches. Moreover, we discuss deep learning approaches for generating tabular data, and we also provide an overview over strategies for explaining deep models on tabular data. Thus, our first contribution is to address the main research streams and existing methodologies in the mentioned areas, while highlighting relevant challenges and open research questions. Our second contribution is to provide an empirical comparison of traditional machine learning methods with eleven deep learning approaches across five popular real-world tabular data sets of different sizes and with different learning objectives. Our results, which we have made publicly available as competitive benchmarks, indicate that algorithms based on gradient-boosted tree ensembles still mostly outperform deep learning models on supervised learning tasks, suggesting that the research progress on competitive deep learning models for tabular data is stagnating. To the best of our knowledge, this is the first in-depth overview of deep learning approaches for tabular data; as such, this work can serve as a valuable starting point to guide researchers and practitioners interested in deep learning with tabular data.

翻译：电子表格数据是最常用的数据形式,对于许多关键和计算要求很高的深层应用程序至关重要。在同质数据集方面,深神经网络一再显示优异的性能,因此被广泛采用。然而,它们适应用于推断或数据生成任务的表格数据仍然具有挑战性。为了促进该领域的进一步进展,这项工作概述了表格数据方面最先进的深层次学习方法。我们将这些方法分为三类:数据转换、专门架构和正规化模式。对于其中每一个群体,我们的工作提供了主要方法的全面概览。此外,我们讨论了生成表格数据的深层学习方法,因此,我们还就解释表格数据的深层模型的战略提供了概览。因此,我们的第一个贡献是处理上述领域的主要研究流和现有方法,同时突出相关的挑战和公开研究问题。我们的第二个贡献是提供传统机器学习方法的经验性比较,在五组广受欢迎的真实表格数据组中,不同规模和不同学习目标,我们的工作成果,我们作为竞争性基准公开提供的关于生成表格数据的深层学习方法,我们作为最有竞争力的深层次研究基础,我们的第一个贡献显示,在深度研究模型上进行最有说服性的数据矩阵学习。