For high-dimensional data or data with noise variables, tandem clustering is a well-known technique that aims to improve cluster identification by first reducing the dimension. However, the usual approach using principal component analysis (PCA) has been criticized for focusing only on inertia so that the first components do not necessarily retain the structure of interest for clustering. To overcome this drawback, we propose a new tandem clustering approach based on invariant coordinate selection (ICS). By jointly diagonalizing two scatter matrices, ICS is designed to find structure in the data while returning affine invariant components. Some theoretical results have already been derived and guarantee that under some elliptical mixture models, the structure of the data can be highlighted on a subset of the first and/or last components. Nevertheless, ICS has received little attention in a clustering context. Two challenges are the choice of the pair of scatter matrices and the selection of the components to retain. For clustering purposes, we demonstrate that the best scatter pairs consist of one scatter matrix that captures the within-cluster structure and another that captures the global structure. For the former, local shape or pairwise scatters are of great interest, as is the minimum covariance determinant (MCD) estimator based on a carefully selected subset size that is smaller than usual. We evaluate the performance of ICS as a dimension reduction method in terms of preserving the cluster structure present in data. In an extensive simulation study and in empirical applications with benchmark data sets, we compare different combinations of scatter matrices, component selection criteria, and the impact of outliers. Overall, the new approach of tandem clustering with ICS shows promising results and clearly outperforms the approach with PCA.
翻译:对于高维数据或带有噪音变量的数据而言,协同集群是一种众所周知的技术,目的是通过首先降低尺寸来改进集群识别,从而改进集群识别;然而,通常使用主要组成部分分析的方法(PCA)受到批评,因为通常使用的主要组成部分分析方法(PCA)仅侧重于惰性,因此第一批组成部分不一定保留集群的兴趣结构。为了克服这一缺陷,我们提议采用基于不变协调选择(ICS)的新的同步组合方法。通过联合对两个散射矩阵进行分解,ICS旨在找到数据中的结构,同时返回偏差的成分。一些理论结果已经得出,并保证在某些螺旋混合模型下,数据分析(PCA)通常只侧重于惰性,因此,ICS在组组合中没有多少关注第一个和/或最后一个组成部分的结构。为了克服这一缺陷,我们建议采用基于不变协调选择的组合选择的组合方法。为了组合的目的,我们证明最佳散射配方包括一个显示内组结构的散射矩阵和另一个捕捉到全球结构的散射矩阵。对于前、本地形状或对齐的流数据结构,在第一个和/或最后几个组成部分中可以突出显示一个我们所选择的基数的基数标准,我们所选择的SBICSBSBSB的精确的数值,这是一个最起码的基数的数值的数值,我们所选的基数的数值的模型的数值的数值是精确的基数级的基数的数值,我们所选的基数的基数的基数的基数的基数,我们所选的基数的基数的基数的基数的精确性比的基数。