Clustering is an essential task to unsupervised learning. It tries to automatically separate instances into coherent subsets. As one of the most well-known clustering algorithms, k-means assigns sample points at the boundary to a unique cluster, while it does not utilize the information of sample distribution or density. Comparably, it would potentially be more beneficial to consider the probability of each sample in a possible cluster. To this end, this paper generalizes k-means to model the distribution of clusters. Our novel clustering algorithm thus models the distributions of distances to centroids over a threshold by Generalized Pareto Distribution (GPD) in Extreme Value Theory (EVT). Notably, we propose the concept of centroid margin distance, use GPD to establish a probability model for each cluster, and perform a clustering algorithm based on the covering probability function derived from GPD. Such a GPD k-means thus enables the clustering algorithm from the probabilistic perspective. Correspondingly, we also introduce a naive baseline, dubbed as Generalized Extreme Value (GEV) k-means. GEV fits the distribution of the block maxima. In contrast, the GPD fits the distribution of distance to the centroid exceeding a sufficiently large threshold, leading to a more stable performance of GPD k-means. Notably, GEV k-means can also estimate cluster structure and thus perform reasonably well over classical k-means. Thus, extensive experiments on synthetic datasets and real datasets demonstrate that GPD k-means outperforms competitors. The github codes are released in https://github.com/sixiaozheng/EVT-K-means.
翻译:分组是不受监督学习的基本任务。 它试图自动将事件区分为一致的子集。 作为最著名的群集算法之一, k 方式将边界上的样本点指定为独特的群集, 虽然它不使用样本分布或密度的信息。 比较而言, 考虑每个样本在可能的群集中的概率可能更有好处。 为此, 本文概括了 k 方式来模拟群集的分布。 我们的新组合算法因此模拟了在超值理论( EVT) 中通用的Pareto 分布( GPD) 阈值下的一个阈值上, kModel 将样本点的分布分配分配到一个独特的群集中, 但它使用 GPD 来建立每个集分布的概率模型或密度。 这样, GPD k- 手段就可以从概率的角度进行群集的组合算算法。 与此相对, 我们也可以引入一个天真的基线, 以通用的极值 k- 对象值 (GEVT) k means 来计算。 GEV- developeralalalalal- developalalalalalal dalal lades lades lade ladeal dal dal deal deal lades lade 。 将GGPal- gPral- kP dal deal deal deal deal devals lade lauts lauts lauts lauts exal lauts lauts lauts ladal 进行一个比, 这样, 数据流算算算算算出一个比高GPDaldald gGGPD daldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldal 。 。 数据