The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algorithm is very sensitive to outliers, produces highly skewed dendrograms, and therefore usually does not reflect the true underlying data structure -- unless the clusters are well-separated. To overcome its limitations, we propose a new hierarchical clustering linkage criterion called Genie. Namely, our algorithm links two clusters in such a way that a chosen economic inequity measure (e.g., the Gini- or Bonferroni-index) of the cluster sizes does not drastically increase above a given threshold. The presented benchmarks indicate a high practical usefulness of the introduced method: it most often outperforms the Ward or average linkage in terms of the clustering quality while retaining the single linkage's speed. The Genie algorithm is easily parallelizable and thus may be run on multiple threads to speed up its execution even further. Its memory overhead is small: there is no need to precompute the complete distance matrix to perform the computations in order to obtain a desired clustering. It can be applied on arbitrary spaces equipped with a dissimilarity measure, e.g., on real vectors, DNA or protein sequences, images, rankings, informetric data, etc. A reference implementation of the algorithm has been included in the open source 'genie' package for R. See also https://genieclust.gagolewski.com for a new implementation (genieclust) -- available for both R and Python.
翻译:应用等级类组算法所需要的时间通常以对齐不同度测量的计算数量为主。 对于较大的数据集来说,这种限制不利于使用所有古典链接标准,但单一链接标准。 然而,已知单一链接组合算法对外端非常敏感,产生高度扭曲的斜体格,因此通常不反映真正的原始数据结构 -- 除非组合组是分开的。为了克服其局限性,我们提议了一个新的等级类组连接标准,称为Genie。也就是说,我们的算法链接了两个组群,这样,就使得所选择的群组规模的经济不平等衡量标准(例如, Gini- 或 Bonferroni- index) 无法大大超过给外端值。 但是, 所提供的基准表明采用的方法非常实用性强: 它通常在保持单一链接的速度的同时, 通常比标准值或平均的组合质量差。 Genie 算法可以很容易地平行, 因此, 可以在多个线索上运行, 甚至加速其执行。 其内存的 Rdrodeal dislation 将数据应用到直径 。