We address the problem of defining a group sparse formulation for Principal Components Analysis (PCA) - or its equivalent formulations as Low Rank approximation or Dictionary Learning problems - which achieves a compromise between maximizing the variance explained by the components and promoting sparsity of the loadings. So we propose first a new definition of the variance explained by non necessarily orthogonal components, which is optimal in some aspect and compatible with the principal components situation. Then we use a specific regularization of this variance by the group-$\ell_{1}$ norm to define a Group Sparse Maximum Variance (GSMV) formulation of PCA. The GSMV formulation achieves our objective by construction, and has the nice property that the inner non smooth optimization problem can be solved analytically, thus reducing GSMV to the maximization of a smooth and convex function under unit norm and orthogonality constraints, which generalizes Journee et al. (2010) to group sparsity. Numerical comparison with deflation on synthetic data shows that GSMV produces steadily slightly better and more robust results for the retrieval of hidden sparse structures, and is about three times faster on these examples. Application to real data shows the interest of group sparsity for variables selection in PCA of mixed data (categorical/numerical) .
翻译:我们首先提出因主构件分析(PCA)的组稀少配方,或其等效配方,即低端近似值或词典学习问题,从而在最大程度消除各构件解释的差异和促进装载的宽度之间达成妥协;因此,我们首先提出由非必然正向成分解释的差异新定义,在某些方面是最佳的,与主要构件情况相符;然后,我们采用按1美元-ell ⁇ 1美元标准对这一差异进行具体规范化,以界定五氯苯甲醚的组散最大差异(GSMV)配方。 GSMV的配方通过构建实现了我们的目标,并具有良好的属性,即以分析方式解决内部非平稳优化问题,从而将GSMV降低到在单位规范或孔度制约下将光滑和锥体功能最大化,这一般地将Journee等人和他人(2010年)与主要构件情况相匹配;与合成数据通缩的比表明,GMVPMV为隐藏的稀疏漏结构的调取结果越来越好,而且对于这些样品来说要快三倍。