Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they can be meaningful in data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm and the expert labels from a large battery of benchmark data, we discover that MST methods can overall be very competitive. Next, instead of proposing yet another algorithm that performs well on a limited set of examples, we review, study, extend, and generalise existing, the state-of-the-art MST-based partitioning schemes, which leads to a few new and interesting approaches. It turns out that the Genie method and the information-theoretic approaches often outperform the non-MST algorithms such as k-means, Gaussian mixtures, spectral clustering, BIRCH, and classical hierarchical agglomerative procedures.
翻译:最小覆盖树( MST) 提供了众多模式识别活动中方便的数据集的描述。 此外,它们比较快可以计算。 在本文中, 我们量化它们在多大程度上在数据分组任务中具有意义。 通过确定最佳( oracle) 算法和大量基准数据组的专家标签之间的协议的上方界限, 我们发现MST 方法总体上可以非常具有竞争力。 其次, 我们不提出另一个在有限范例组上表现良好的算法, 而是审查、 研究、 扩展和概括现有的最先进的基于MST 的分区方案, 从而导致一些新的、 有趣的方法。 事实证明, Genie 方法和信息理论方法往往优于非 MST 算法, 如 k- 平均值、 高斯 混合物、 光谱组合、 BIRCH 和 经典等级聚合程序 。</s>