There has been an intense recent activity in embedding of very high dimensional and nonlinear data structures, much of it in the data science and machine learning literature. We survey this activity in four parts. In the first part we cover nonlinear methods such as principal curves, multidimensional scaling, local linear methods, ISOMAP, graph based methods and diffusion mapping, kernel based methods and random projections. The second part is concerned with topological embedding methods, in particular mapping topological properties into persistence diagrams and the Mapper algorithm. Another type of data sets with a tremendous growth is very high-dimensional network data. The task considered in part three is how to embed such data in a vector space of moderate dimension to make the data amenable to traditional techniques such as cluster and classification techniques. Arguably this is the part where the contrast between algorithmic machine learning methods and statistical modeling, the so-called stochastic block modeling, is at its greatest. In the paper, we discuss the pros and cons for the two approaches. The final part of the survey deals with embedding in $\mathbb{R}^ 2$, i.e. visualization. Three methods are presented: $t$-SNE, UMAP and LargeVis based on methods in parts one, two and three, respectively. The methods are illustrated and compared on two simulated data sets; one consisting of a triplet of noisy Ranunculoid curves, and one consisting of networks of increasing complexity generated with stochastic block models and with two types of nodes.
翻译:最近,在嵌入非常高的维度和非线性数据结构方面开展了密集的活动,其中很多是在数据科学和机器学习文献中进行。我们用四个部分来调查这一活动。在第一部分,我们涵盖了非线性方法,如主曲线、多维缩放、局部线性方法、ISOMAP、基于图形的方法和传播绘图、内核方法和随机预测。第二部分涉及地形嵌入方法,特别是将表层属性特性映射到持久性图表和地图算法中。另一类具有巨大增长的数据集是非常高的网络数据。第三部分所考虑的任务是如何将这些数据嵌入中等范围的矢量空间,以使数据适合传统技术,如集束和分类技术、ISOMAP、所谓的平面块模型和统计模型之间的对比最大。在本文中,我们讨论了两种方法的螺旋质属性和组合。调查的最后一部分涉及嵌入 $\mathrobb@R_%% 2 和图解析图的两种方法, i. dismal 3 方法分别显示为Smal-qual 3 和 mal manages AS 。