VARCLUST algorithm is proposed for clustering variables under the assumption that variables in a given cluster are linear combinations of a small number of hidden latent variables, corrupted by the random noise. The entire clustering task is viewed as the problem of selection of the statistical model, which is defined by the number of clusters, the partition of variables into these clusters and the 'cluster dimensions', i.e. the vector of dimensions of linear subspaces spanning each of the clusters. The optimal model is selected using the approximate Bayesian criterion based on the Laplace approximations and using a non-informative uniform prior on the number of clusters. To solve the problem of the search over a huge space of possible models we propose an extension of the ClustOfVar algorithm which was dedicated to subspaces of dimension only 1, and which is similar in structure to the $K$-centroid algorithm. We provide a complete methodology with theoretical guarantees, extensive numerical experimentations, complete data analyses and implementation. Our algorithm assigns variables to appropriate clusterse based on the consistent Bayesian Information Criterion (BIC), and estimates the dimensionality of each cluster by the PEnalized SEmi-integrated Likelihood Criterion (PESEL), whose consistency we prove. Additionally, we prove that each iteration of our algorithm leads to an increase of the Laplace approximation to the model posterior probability and provide the criterion for the estimation of the number of clusters. Numerical comparisons with other algorithms show that VARCLUST may outperform some popular machine learning tools for sparse subspace clustering. We also report the results of real data analysis including TCGA breast cancer data and meteorological data. The proposed method is implemented in the publicly available R package varclust.
翻译:VARCLUST 算法是用来分组变量的, 假设特定组群中的变量是少数隐藏的潜在变量的线性组合, 被随机噪音腐蚀。 整个组群任务被视为选择统计模型的问题, 统计模型由组群数量、 变量在这些组群中的分布以及“ 组群尺寸”, 即线性子空间范围在每个组群中的矢量。 最佳模型是使用基于Laplace 比较基数的近似巴伊西亚标准, 并在组群数量之前使用非信息统一的标准来选择的。 为了解决在巨大的可能模型空间上搜索的问题, 我们提议扩展ClustOFVar 算法, 因为它仅针对尺寸的子空间1, 其结构与 $K$c$centromology 运算法相似。 我们提供了一个完整的方法, 广泛的数字实验、 完整的数据分析和实施。 我们的算法根据一致的Bayeserlical Clority (BIC), 估计每个组群组的尺寸, 包括Crealliumalalalalalalalalal dal dalalalalal dalalalal dal dal 。 我们的每个组数, 也显示了我们的数值的数值为Silding dalalalaldalevalevalalalalalalalalalal exal exal exal 。