One of the main challenges for hierarchical clustering is how to appropriately identify the representative points in the lower level of the cluster tree, which are going to be utilized as the roots in the higher level of the cluster tree for further aggregation. However, conventional hierarchical clustering approaches have adopted some simple tricks to select the "representative" points which might not be as representative as enough. Thus, the constructed cluster tree is less attractive in terms of its poor robustness and weak reliability. Aiming at this issue, we propose a novel hierarchical clustering algorithm, in which, while building the clustering dendrogram, we can effectively detect the representative point based on scoring the reciprocal nearest data points in each sub-minimum-spanning-tree. Extensive experiments on UCI datasets show that the proposed algorithm is more accurate than other benchmarks. Meanwhile, under our analysis, the proposed algorithm has O(nlogn) time-complexity and O(logn) space-complexity, indicating that it has the scalability in handling massive data with less time and storage consumptions.
翻译:分类组合的主要挑战之一是,如何适当确定组群树下层的代表点,这些代表点将作为集群树上较高层的根基加以利用,以便进一步汇总;然而,传统的分类组合办法采用一些简单的技巧来选择“代表性”点,这些“代表性”点可能不够有代表性。因此,建造的组群树的强度弱和可靠性差,因此其吸引力较小。为了解决这一问题,我们建议采用新的分类组合算法,在建立群群时,我们可以根据每一次最小树上对等最接近的数据点的评分,有效地检测代表点。关于UCI数据集的广泛实验表明,拟议的算法比其他基准更准确。同时,根据我们的分析,拟议的算法具有O(nlogn)时间复杂性和O(logn)空间复杂性,表明它在处理大量数据时和存储消耗量较少的情况下具有可扩缩性。