There has been an intense recent activity in embedding of very high dimensional and nonlinear data structures, much of it in the data science and machine learning literature. We survey this activity in four parts. In the first part we cover nonlinear methods such as principal curves, multidimensional scaling, local linear methods, ISOMAP, graph based methods and kernel based methods. The second part is concerned with topological embedding methods, in particular mapping topological properties into persistence diagrams. Another type of data sets with a tremendous growth is very high-dimensional network data. The task considered in part three is how to embed such data in a vector space of moderate dimension to make the data amenable to traditional techniques such as cluster and classification techniques. The final part of the survey deals with embedding in $\mathbb{R}^2$, which is visualization. Three methods are presented: $t$-SNE, UMAP and LargeVis based on methods in parts one, two and three, respectively. The methods are illustrated and compared on two simulated data sets; one consisting of a triple of noisy Ranunculoid curves, and one consisting of networks of increasing complexity and with two types of nodes.
翻译:最近,在嵌入非常高的天体和非线性数据结构方面,最近开展了密集的活动,大部分是数据科学和机器学习文献中的数据科学和机器学习文献。我们对这一活动进行了四个部分的调查。在第一部分,我们涉及非线性方法,如主曲线、多维缩放、局部线性方法、ISOMAP、基于图形的方法和内核方法。第二部分涉及地形嵌入方法,特别是将地形特性绘图纳入持久性图表。另一类具有巨大增长的数据集是非常高的网络数据。第三部分所考虑的任务是如何将这类数据嵌入中等维度的矢量空间,使数据适合集群和分类技术等传统技术。调查的最后阶段涉及嵌入$\mathbb{R ⁇ 2$,这是可视化。介绍了三种方法:美元-SNE、UMAP和大Vis,分别以第一部分、第二部分和第三部分的方法为基础。用两个模拟数据集对方法进行了说明和比较;其中一套是三重无声调的Ranculsoul曲线的三倍,另一套由复杂和两种类型组成的网络组成。