\textbf{A}ccuracy, \textbf{R}obustness to noises and scales, \textbf{I}nterpretability, \textbf{S}peed, and \textbf{E}asy to use (ARISE) are crucial requirements of a good clustering algorithm. However, achieving these goals simultaneously is challenging, and most advanced approaches only focus on parts of them. Towards an overall consideration of these aspects, we propose a novel clustering algorithm, namely GIT (Clustering Based on \textbf{G}raph of \textbf{I}ntensity \textbf{T}opology). GIT considers both local and global data structures: firstly forming local clusters based on intensity peaks of samples, and then estimating the global topological graph (topo-graph) between these local clusters. We use the Wasserstein Distance between the predicted and prior class proportions to automatically cut noisy edges in the topo-graph and merge connected local clusters as final clusters. Then, we compare GIT with seven competing algorithms on five synthetic datasets and nine real-world datasets. With fast local cluster detection, robust topo-graph construction and accurate edge-cutting, GIT shows attractive ARISE performance and significantly exceeds other non-convex clustering methods. For example, GIT outperforms its counterparts about $10\%$ (F1-score) on MNIST and FashionMNIST. Code is available at \color{red}{https://github.com/gaozhangyang/GIT}.
翻译:\ textbf{ A} 精度,\ textbf{R} 精度,\ textbf{ I} 精度,\ textbf{S} 精度,\ textbf{S} peed, 和\ textbf{E} 要使用的系统( ARISE) 是良好的群集算法的关键要求。 然而, 实现这些目标同时具有挑战性, 多数先进方法只注重于其中的某些部分。 为了全面考虑这些方面, 我们提议了一个新的群集算法, 即 GIT (基于\ textbf{ G} 的分类法,\ textbf{ I} 温度,\ textbf{S} peed, 和\ textbff{T}Oblogy) 。 GIT 既考虑本地数据结构, 也考虑全球数据结构 : 首先根据样本的密度峰值形成本地群集(topographal) 估计全球表(tographyg) 。 我们使用预测的瓦斯坦度距离, 在五度和连接的本地群集组群集中自动缩小和合并的焦距, 在最后组中, 和本地组群集中, 我们比较了GIT 的精确的域的域组化数据, 的域组化数据比 GIT- 7 和地理组。