Deep learning methods in the literature are commonly benchmarked on image data sets, which may not be suitable or effective baselines for non-image tabular data. In this paper, we take a data-centric view to perform one of the first studies on deep embedding clustering of tabular data. Eight clustering and state-of-the-art embedding clustering methods proposed for image data sets are tested on seven tabular data sets. Our results reveal that a traditional clustering method ranks second out of eight methods and is superior to most deep embedding clustering baselines. Our observation aligns with the literature that conventional machine learning of tabular data is still a robust approach against deep learning. Therefore, state-of-the-art embedding clustering methods should consider data-centric customization of learning architectures to become competitive baselines for tabular data.
翻译:文献中的深层学习方法通常以图像数据集为基准,这些数据集可能不适合非图像表列数据,或并非非图像表列数据的有效基线。在本文中,我们采取以数据为中心的视角,以进行关于表格数据深度嵌入群集的首批研究之一。在七个表格数据集中测试了为图像数据集提议的8个集群和最先进的嵌入群集方法。我们的结果显示,传统的集群方法排在8种方法的第二位,高于最深层嵌入群集基线。我们的观察与文献一致,即常规机器对列表数据学习仍然是防止深层学习的强有力方法。因此,最先进的嵌入群集方法应考虑以数据为核心的定制学习结构,成为表格数据的竞争性基线。