Clustering is a commonplace problem in many areas of data science, with applications in biology and bioinformatics, understanding chemical structure, image segmentation, building recommender systems, and many more fields. While there are many different clustering variants (based on given distance or graph structure, probability distributions, or data density), we consider here the problem of clustering nodes in a graph, motivated by the problem of aggregating discrete degrees of freedom in multigrid and domain decomposition methods for solving sparse linear systems. Specifically, we consider the challenge of forming balanced clusters in the graph of a sparse matrix for use in algebraic multigrid, although the algorithm has general applicability. Based on an extension of the Bellman-Ford algorithm, we generalize Lloyd's algorithm for partitioning subsets of Rn to balance the number of nodes in each cluster; this is accompanied by a rebalancing algorithm that reduces the overall energy in the system. The algorithm provides control over the number of clusters and leads to "well centered" partitions of the graph. Theoretical results are provided to establish linear complexity and numerical results in the context of algebraic multigrid highlight the benefits of improved clustering.
翻译:在数据科学的许多领域,集群是一个常见的问题,在生物和生物信息学、理解化学结构、图像分割、建筑建议系统等应用领域和更多的领域都存在。虽然有许多不同的集群变体(根据给定的距离或图形结构、概率分布或数据密度),但我们在此考虑在图表中将节点分组的问题,其动机是多格和域分解方法中分解自由度,以解决稀薄线性系统。具体地说,我们考虑在用于代数多格的稀薄矩阵图中形成平衡的集群的挑战,尽管算法具有一般适用性。根据贝尔曼-福德算法的扩展,我们普遍采用劳埃德对Rn子分组进行分区分配的算法,以平衡每个组节点的数目;与此同时,还采用再平衡算法,减少系统的总体能量。算法对组数进行控制,并导致图形的“中枢”分区。提供了理论结果,以便在高格多格内格中确定线性的复杂性和数字结果,突出改进的组合的好处。</s>