Topology based dimensionality reduction methods such as t-SNE and UMAP have seen increasing success and popularity in high-dimensional data. These methods have strong mathematical foundations and are based on the intuition that the topology in low dimensions should be close to that of high dimensions. Given that the initial topological structure is a precursor to the success of the algorithm, this naturally raises the question: What makes a "good" topological structure for dimensionality reduction? %Insight into this will enable us to design better algorithms which take into account both local and global structure. In this paper which focuses on UMAP, we study the effects of node connectivity (k-Nearest Neighbors vs \textit{mutual} k-Nearest Neighbors) and relative neighborhood (Adjacent via Path Neighbors) on dimensionality reduction. We explore these concepts through extensive ablation studies on 4 standard image and text datasets; MNIST, FMNIST, 20NG, AG, reducing to 2 and 64 dimensions. Our findings indicate that a more refined notion of connectivity (\textit{mutual} k-Nearest Neighbors with minimum spanning tree) together with a flexible method of constructing the local neighborhood (Path Neighbors), can achieve a much better representation than default UMAP, as measured by downstream clustering performance.
翻译:T-SNE 和 UMAP 等基于地形的减少方法在高维数据中取得了越来越多的成功和受欢迎程度。这些方法具有很强的数学基础,并且基于以下直觉:低维的地形学应该接近高维。鉴于最初的地形结构是算法成功的一个先导,这自然提出了这样一个问题:“良好的”地形结构对于降低维度来说是什么作用?% 深入到这里将使我们能够设计出更好的算法,其中既考虑到当地结构,也考虑到全球结构。在这份以 UMAP为重点的文件中,我们研究了节点连接(k-earest Neighbors vs\ textit{mutual} k-nearest Neghbors)和相对邻里(通过路径相邻相邻相邻的距离)在维度降低方面的影响。我们通过对4个标准图像和文本数据集进行广泛的对比研究来探索这些概念;MNIST,FMNIST, 20NG, AG, 减为2和64个维度。我们的研究结果表明,一个更精细的连接概念是连接性概念(\\ negh),可以与最精确的直径直径直径直径直径直径直径,可以实现。