The latent space of autoencoders has been improved for clustering image data by jointly learning a t-distributed embedding with a clustering algorithm inspired by the neighborhood embedding concept proposed for data visualization. However, multivariate tabular data pose different challenges in representation learning than image data, where traditional machine learning is often superior to deep tabular data learning. In this paper, we address the challenges of learning tabular data in contrast to image data and present a novel Gaussian Cluster Embedding in Autoencoder Latent Space (G-CEALS) algorithm by replacing t-distributions with multivariate Gaussian clusters. Unlike current methods, the proposed approach independently defines the Gaussian embedding and the target cluster distribution to accommodate any clustering algorithm in representation learning. A trained G-CEALS model extracts a quality embedding for unseen test data. Based on the embedding clustering accuracy, the average rank of the proposed G-CEALS method is 1.4 (0.7), which is superior to all eight baseline clustering and cluster embedding methods on seven tabular data sets. This paper shows one of the first algorithms to jointly learn embedding and clustering to improve multivariate tabular data representation in downstream clustering.
翻译:自动编码器的潜在空间已经得到改善,通过共同学习由邻里嵌入概念为数据可视化而提出的嵌入概念所启发的集群算法,使自动编码器的潜在空间在组合图像数据方面有所改进,但多变量表格数据在代表性学习方面提出了不同于图像数据的挑战,因为传统机器学习往往优于深层表格数据学习。在本文中,我们处理的是与图像数据相比学习表格数据的挑战,并提出了在自动编码器Latetant空间(G-CEALS)中嵌入高斯群集(G-CEALS)的新型高斯群集算法,以多变量高斯群集取代T分布。与目前的方法不同,拟议方法独立定义了高斯嵌入和目标组群集分布,以适应代表性学习中的任何组合算法。经过培训的G-CEALS模型提取了隐蔽测试数据的质量嵌入。根据嵌入群集的精度,拟议G-CEALS方法的平均等级为1.4(0.7),高于所有8个基线组合群集和组嵌入7个表格数据集的方法。本文显示了第一批算法,以共同学习将数据嵌入和集成和投入至下层的数据群集。