Clustering points in a vector space or nodes in a graph is a ubiquitous primitive in statistical data analysis, and it is commonly used for exploratory data analysis. In practice, it is often of interest to "refine" or "improve" a given cluster that has been obtained by some other method. In this survey, we focus on principled algorithms for this cluster improvement problem. Many such cluster improvement algorithms are flow-based methods, by which we mean that operationally they require the solution of a sequence of maximum flow problems on a (typically implicitly) modified data graph. These cluster improvement algorithms are powerful, both in theory and in practice, but they have not been widely adopted for problems such as community detection, local graph clustering, semi-supervised learning, etc. Possible reasons for this are: the steep learning curve for these algorithms; the lack of efficient and easy to use software; and the lack of detailed numerical experiments on real-world data that demonstrate their usefulness. Our objective here is to address these issues. To do so, we guide the reader through the whole process of understanding how to implement and apply these powerful algorithms. We present a unifying fractional programming optimization framework that permits us to distill, in a simple way, the crucial components of all these algorithms. It also makes apparent similarities and differences between related methods. Viewing these cluster improvement algorithms via a fractional programming framework suggests directions for future algorithm development. Finally, we develop efficient implementations of these algorithms in our LocalGraphClustering Python package, and we perform extensive numerical experiments to demonstrate the performance of these methods on social networks and image-based data graphs.
翻译:矢量空间或图中节点的分组点在矢量空间或图表中的节点中是一个普遍存在的原始统计数据分析,通常用于探索性数据分析。在实践上,“refine”或“improvive”某个以其他方法获得的给定组群往往感兴趣。在本次调查中,我们侧重于该群集改进问题的原则算法。许多这类群集改进算法都是基于流程的方法,这意味着在操作中,它们需要在(通常隐含的)广泛修改的数据图表中解决一系列最大流量问题。这些群集改进算法在理论和实践上都很强大,但对于社区检测、本地图组群集、半监督学习等以其他方法获得的问题却没有被广泛采用。在本次调查中,我们侧重于这些快速学习曲线的曲线;缺乏高效和易于使用软件;在真实世界数据上缺乏详细的数字实验以证明其有用性。我们的目的是解决这些问题。为了这样做,我们指导读者在整个过程中了解如何实施和运用这些强大的网络的精准算法,我们并没有被广泛采用这些强大的算法,但是,在最终的图像分析方法中,我们展示了这些精确的精确的比值矩阵框架。 我们将这些数字的缩化了这些算法用于了我们之间的细化的细化方法。