This article explores and analyzes the unsupervised clustering of large partially observed graphs. We propose a scalable and provable randomized framework for clustering graphs generated from the stochastic block model. The clustering is first applied to a sub-matrix of the graph's adjacency matrix associated with a reduced graph sketch constructed using random sampling. Then, the clusters of the full graph are inferred based on the clusters extracted from the sketch using a correlation-based retrieval step. Uniform random node sampling is shown to improve the computational complexity over clustering of the full graph when the cluster sizes are balanced. A new random degree-based node sampling algorithm is presented which significantly improves upon the performance of the clustering algorithm even when clusters are unbalanced. This framework improves the phase transitions for matrix-decomposition-based clustering with regard to computational complexity and minimum cluster size, which are shown to be nearly dimension-free in the low inter-cluster connectivity regime. A third sampling technique is shown to improve balance by randomly sampling nodes based on spatial distribution. We provide analysis and numerical results using a convex clustering algorithm based on matrix completion.
翻译:本条探讨并分析了未经监督的大型部分观测图集群集。 我们为从随机区块模型中生成的群集图提出了可缩放和可变随机框架。 群集首先应用于图形相邻矩阵的子矩阵, 与通过随机抽样构建的缩小图形草图有关。 然后, 整张图群集根据利用基于相关性的检索步骤从草图中提取的群集推断出来。 统一的随机节点抽样显示, 以便在群集大小平衡的情况下, 提高整张图群集的计算复杂性。 新的随机度节点抽样算法显示, 即使在群集不平衡的情况下, 也会大大改进集算法的性能。 这个框架改进了基于计算复杂性和最小群集大小的矩阵脱钩群集的阶段过渡, 显示在低组群集连通制度下几乎没有维度。 第三个抽样技术显示, 通过基于空间分布的随机取样节点来改善平衡。 我们利用基于矩阵完成的等式组合群集算法, 提供了分析和数字结果。