高度数据分析持续代表性学习 (Consistent Representation Learning for High Dimensional Data Analysis)

High dimensional data analysis for exploration and discovery includes three fundamental tasks: dimensionality reduction, clustering, and visualization. When the three associated tasks are done separately, as is often the case thus far, inconsistencies can occur among the tasks in terms of data geometry and others. This can lead to confusing or misleading data interpretation. In this paper, we propose a novel neural network-based method, called Consistent Representation Learning (CRL), to accomplish the three associated tasks end-to-end and improve the consistencies. The CRL network consists of two nonlinear dimensionality reduction (NLDR) transformations: (1) one from the input data space to the latent feature space for clustering, and (2) the other from the clustering space to the final 2D or 3D space for visualization. Importantly, the two NLDR transformations are performed to best satisfy local geometry preserving (LGP) constraints across the spaces or network layers, to improve data consistencies along with the processing flow. Also, we propose a novel metric, clustering-visualization inconsistency (CVI), for evaluating the inconsistencies. Extensive comparative results show that the proposed CRL neural network method outperforms the popular t-SNE and UMAP-based and other contemporary clustering and visualization algorithms in terms of evaluation metrics and visualization.

翻译：用于勘探和发现的高维数据分析包括三项基本任务:维维度减少、集群和可视化。当三项相关任务分别执行时,正如目前经常发生的情况一样,在数据几何和其他方面的任务之间可能会出现不一致。这可能导致数据解释的混乱或误导。在本文中,我们建议采用一种新的神经网络方法,称为“一致代表学习”,以完成三项相关任务,端至端至端,并改进整体性。CRL网络包括两种非线性维度减少(NLDR)转换:(1)从输入数据空间到组合的潜在特征空间,以及(2)从组合空间到最后2D或3D空间,以可视化。重要的是,为了最好地满足空间或网络层的当地几度保持(LGP)限制,我们提出了一种新型的衡量、集群-可视化(CVI),以评价不一致之处。广泛的比较结果显示,拟议的CRL神经网络现代化和直观数据采集模型方法超越了ULMARMU和视觉模型的其他通用方法。

相关内容

表示学习

关注 186

表示学习是通过利用训练数据来学习得到向量表示，这可以克服人工方法的局限性。表示学习通常可分为两大类，无监督和有监督表示学习。大多数无监督表示学习方法利用自动编码器（如去噪自动编码器和稀疏自动编码器等）中的隐变量作为表示。目前出现的变分自动编码器能够更好的容忍噪声和异常值。然而，推断给定数据的潜在结构几乎是不可能的。目前有一些近似推断的策略。此外，一些无监督表示学习方法旨在近似某种特定的相似性度量。提出了一种无监督的相似性保持表示学习框架，该框架使用矩阵分解来保持成对的DTW相似性。通过学习保持DTW的shaplets，即在转换后的空间中的欧式距离近似原始数据的真实DTW距离。有监督表示学习方法可以利用数据的标签信息，更好地捕获数据的语义结构。孪生网络和三元组网络是目前两种比较流行的模型，它们的目标是最大化类别之间的距离并最小化了类别内部的距离。

【ETH】最新《几何数据分析》2020课程，附PPT下载

专知会员服务

45+阅读 · 2020年12月18日

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

数据科学导论，54页ppt，Introduction to Data Science

专知会员服务

42+阅读 · 2020年7月27日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日