Clustering and visualizing high-dimensional (HD) data are important tasks in a variety of fields. For example, in bioinformatics, they are crucial for analyses of single-cell data such as mass cytometry (CyTOF) data. Some of the most effective algorithms for clustering HD data are based on representing the data by nodes in a graph, with edges connecting neighbouring nodes according to some measure of similarity or distance. However, users of graph-based algorithms are typically faced with the critical but challenging task of choosing the value of an input parameter that sets the size of neighbourhoods in the graph, e.g. the number of nearest neighbours to which to connect each node or a threshold distance for connecting nodes. The burden on the user could be alleviated by a measure of inter-node similarity that can have value 0 for dissimilar nodes without requiring any user-defined parameters or thresholds. This would determine the neighbourhoods automatically while still yielding a sparse graph. To this end, I propose a new method called ASTRICS to measure similarity between clusters of HD data points based on local dimensionality reduction and triangulation of critical alpha shapes. I show that my ASTRICS similarity measure can facilitate both clustering and visualization of HD data by using it in Stage 2 of a three-stage pipeline: Stage 1 = perform an initial clustering of the data by any method; Stage 2 = let graph nodes represent initial clusters instead of individual data points and use ASTRICS to automatically define edges between nodes; Stage 3 = use the graph for further clustering and visualization. This trades the critical task of choosing a graph neighbourhood size for the easier task of essentially choosing a resolution at which to view the data. The graph and consequently downstream clustering and visualization are then automatically adapted to the chosen resolution.
翻译:集成和直观高维(HD)数据是多个领域的重要任务。 例如,在生物信息学中,它们对于分析单细胞数据(如质量细胞测量(CyTOF)数据)至关重要。 将HD数据分组的一些最有效的算法是基于在图形中代表节点的数据, 边緣根据某种程度的相似性或距离将相邻节点连接。 然而, 基于图形的算法用户通常面临关键但具有挑战性的任务, 即选择一个设置图形中相邻区域的输入参数值, 例如, 它们对于分析每个节点(如质量细胞测量(CyTOF)数据的分析至关重要。 将每个节点连接的直径近端数据相邻数从本质上代表连接每个节点或连接点的临界距离。 用户的负担可以通过一个测量相近节点的节点来减轻, 而不需要任何用户定义参数或距离的节点。 这将自动确定相邻点,同时生成一个模糊的图表。 我提议一种新的方法叫ASTIRICS, 用来测量数字组之间的组组合组合, 以本地层面的初始数点为基础, 选择第2级数据流流数据流流流化, 以显示我级数据流流流流位数组的基数组的基数组, 。 将A- strodroudaldroudalmaxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx