We discuss topological aspects of cluster analysis and show that inferring the topological structure of a dataset before clustering it can considerably enhance cluster detection: theoretical arguments and empirical evidence show that clustering embedding vectors, representing the structure of a data manifold instead of the observed feature vectors themselves, is highly beneficial. To demonstrate, we combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN. Synthetic and real data results show that this both simplifies and improves clustering in a diverse set of low- and high-dimensional problems including clusters of varying density and/or entangled shapes. Our approach simplifies clustering because topological pre-processing consistently reduces parameter sensitivity of DBSCAN. Clustering the resulting embeddings with DBSCAN can then even outperform complex methods such as SPECTACL and ClusterGAN. Finally, our investigation suggests that the crucial issue in clustering does not appear to be the nominal dimension of the data or how many irrelevant features it contains, but rather how \textit{separable} the clusters are in the ambient observation space they are embedded in, which is usually the (high-dimensional) Euclidean space defined by the features of the data. Our approach is successful because we perform the cluster analysis after projecting the data into a more suitable space that is optimized for separability, in some sense.
翻译:我们讨论集群分析的地形学方面,并表明在集群之前对数据集的地形结构进行推论,可以大大加强集群的探测:理论论点和实验证据表明,集群嵌入矢量代表数据元体的结构而不是观测到的特性矢量本身,非常有益。为了证明,我们将多重学习方法UMAP与基于密度的集群方法DBSCAN相结合,用以推断地形结构。合成和真实数据结果显示,这既简化又改进了在一组不同的低和高维问题中的集群,包括不同密度和/或缠绕形状的集群。我们的方法简化了集群,因为表面学前处理持续降低DBSCAN的参数敏感性。因此,将由此形成的嵌入DBSCAN的组合,甚至可以超越像SPECTACL和GUCAN这样的复杂方法。最后,我们的调查显示,集群的关键问题似乎不是数据的表面层面或其中包含的很多不相干的特点,而是如何理解 。这些集群在环境观测中是某些环境空间空间空间观测的高度特性,因为我们通常在空间的分辨率分析中进行成功的空间分析。