Deep learning methods in the literature are invariably benchmarked on image data sets and then assumed to work on all data problems. Unfortunately, architectures designed for image learning are often not ready or optimal for non-image data without considering data-specific learning requirements. In this paper, we take a data-centric view to argue that deep image embedding clustering methods are not equally effective on heterogeneous tabular data sets. This paper performs one of the first studies on deep embedding clustering of seven tabular data sets using six state-of-the-art baseline methods proposed for image data sets. Our results reveal that the traditional clustering of tabular data ranks second out of eight methods and is superior to most deep embedding clustering baselines. Our observation is in line with the recent literature that traditional machine learning of tabular data is still a competitive approach against deep learning. Although surprising to many deep learning researchers, traditional clustering methods can be competitive baselines for tabular data, and outperforming these baselines remains a challenge for deep embedding clustering. Therefore, deep learning methods for image learning may not be fair or suitable baselines for tabular data without considering data-specific contrasts and learning requirements.
翻译:文献中的深层学习方法总是以图像数据集为基准,然后假定可以处理所有数据问题。不幸的是,设计用于图像学习的架构往往不适于非图像数据,也不考虑特定数据的学习要求。在本文件中,我们从以数据为中心的观点认为,深层图像嵌入集群方法在多式表格数据集上并不同样有效。本文利用为图像数据集提议的六种最先进的基线方法,对七个表格数据集的深层嵌入集群进行了第一批研究之一。我们的结果显示,传统的列表数据组合在八种方法中排第二位,优于最深层嵌入集群基线。我们的意见与最近的文献一致,即传统机器对列表数据学习仍然是反对深层学习的一种竞争性方法。虽然许多深层研究者感到惊讶的是,传统组合方法可以成为表格数据的竞争性基线,但优于这些基线对于深层嵌入集群来说仍然是一项挑战。因此,不考虑具体数据对比和学习要求,深层图像学习方法对于列表数据来说可能不公平或合适基线。