High dimensional data analysis for exploration and discovery includes three fundamental tasks: dimensionality reduction, clustering, and visualization. When the three associated tasks are done separately, as is often the case thus far, inconsistencies can occur among the tasks in terms of data geometry and others. This can lead to confusing or misleading data interpretation. In this paper, we propose a novel neural network-based method, called Consistent Representation Learning (CRL), to accomplish the three associated tasks end-to-end and improve the consistencies. The CRL network consists of two nonlinear dimensionality reduction (NLDR) transformations: (1) one from the input data space to the latent feature space for clustering, and (2) the other from the clustering space to the final 2D or 3D space for visualization. Importantly, the two NLDR transformations are performed to best satisfy local geometry preserving (LGP) constraints across the spaces or network layers, to improve data consistencies along with the processing flow. Also, we propose a novel metric, clustering-visualization inconsistency (CVI), for evaluating the inconsistencies. Extensive comparative results show that the proposed CRL neural network method outperforms the popular t-SNE and UMAP-based and other contemporary clustering and visualization algorithms in terms of evaluation metrics and visualization.
翻译:用于勘探和发现的高维数据分析包括三项基本任务:维维度减少、集群和可视化。当三项相关任务分别执行时,正如目前经常发生的情况一样,在数据几何和其他方面的任务之间可能会出现不一致。这可能导致数据解释的混乱或误导。在本文中,我们建议采用一种新的神经网络方法,称为“一致代表学习”,以完成三项相关任务,端至端至端,并改进整体性。CRL网络包括两种非线性维度减少(NLDR)转换:(1)从输入数据空间到组合的潜在特征空间,以及(2)从组合空间到最后2D或3D空间,以可视化。重要的是,为了最好地满足空间或网络层的当地几度保持(LGP)限制,我们提出了一种新型的衡量、集群-可视化(CVI),以评价不一致之处。广泛的比较结果显示,拟议的CRL神经网络现代化和直观数据采集模型方法超越了ULMARMU和视觉模型的其他通用方法。