Cluster analysis requires many decisions: the clustering method and the implied reference model, the number of clusters and, often, several hyper-parameters and algorithms' tunings. In practice, one produces several partitions, and a final one is chosen based on validation or selection criteria. There exist an abundance of validation methods that, implicitly or explicitly, assume a certain clustering notion. Moreover, they are often restricted to operate on partitions obtained from a specific method. In this paper, we focus on groups that can be well separated by quadratic or linear boundaries. The reference cluster concept is defined through the quadratic discriminant score function and parameters describing clusters' size, center and scatter. We develop two cluster-quality criteria called quadratic scores. We show that these criteria are consistent with groups generated from a general class of elliptically-symmetric distributions. The quest for this type of groups is common in applications. The connection with likelihood theory for mixture models and model-based clustering is investigated. Based on bootstrap resampling of the quadratic scores, we propose a selection rule that allows choosing among many clustering solutions. The proposed method has the distinctive advantage that it can compare partitions that cannot be compared with other state-of-the-art methods. Extensive numerical experiments and the analysis of real data show that, even if some competing methods turn out to be superior in some setups, the proposed methodology achieves a better overall performance.
翻译:群集分析需要许多决定: 群集方法和隐含的参考模型、 群集数量, 以及往往是若干超参数和算法的调试。 在实践中, 一个会产生几个分区, 最后一个会根据验证或选择标准选择。 有大量的验证方法, 隐含地或明确地假定某种群集分布概念。 此外, 它们往往局限于在从特定方法中获得的分区上运作。 在本文中, 我们侧重于能够很好地被二次或线性界限分隔开来的群体。 参考群集概念是通过描述群集大小、 中心 和 分散的二次对立差异分函数和参数来定义的。 我们制定了两个群集质量标准, 称为二次对等分。 我们表明, 这些标准与一般的体外观分布类别所产生的群群集概念是一致的。 对这种类型的群集的探索通常在应用中很常见。 与混合模型和模型组合组合的可能性理论有关。 基于四分级评分的模型, 我们提出一个选择规则, 允许在许多群集、 中央和分散的参数中做出选择。 我们制定两个群集质量标准, 称为四等分分。 我们证明这些标准与一般的群集分析方法不能比其他不同的分析方法。