分散的五氯苯甲醚:基于整数编程的一个新的可缩放动动画 (Sparse PCA: A New Scalable Estimator Based On Integer Programming)

We consider the Sparse Principal Component Analysis (SPCA) problem under the well-known spiked covariance model. Recent work has shown that the SPCA problem can be reformulated as a Mixed Integer Program (MIP) and can be solved to global optimality, leading to estimators that are known to enjoy optimal statistical properties. However, current MIP algorithms for SPCA are unable to scale beyond instances with a thousand features or so. In this paper, we propose a new estimator for SPCA which can be formulated as a MIP. Different from earlier work, we make use of the underlying spiked covariance model and properties of the multivariate Gaussian distribution to arrive at our estimator. We establish statistical guarantees for our proposed estimator in terms of estimation error and support recovery. We propose a custom algorithm to solve the MIP which is significantly more scalable than off-the-shelf solvers; and demonstrate that our approach can be much more computationally attractive compared to earlier exact MIP-based approaches for the SPCA problem. Our numerical experiments on synthetic and real datasets show that our algorithms can address problems with up to 20000 features in minutes; and generally result in favorable statistical properties compared to existing popular approaches for SPCA.

翻译：我们认为,在众所周知的急剧上升的共变模式下,Sparse本部分析(SPCA)问题是众所周知的顶点主元分析(SPCA)问题。最近的工作表明,SPCA问题可以重新作为混合整数程序(MIP)重新拟订,并可以实现全球最佳性能,从而导致已知享有最佳统计属性的估算者;然而,目前SPCA的MIP算法无法超越具有千个特征或如此特征的假设范围。在本文件中,我们提议了一个新的SPCA估计值的新的估计值。与早先的工作不同,我们利用了多变数的多变数分布的基本螺旋变数模型和属性,以达到我们的估计值。我们为我们提议的估算者设定了统计保证,以估计错误和支持恢复。我们提出了一种定制算法,以解决比现成的解算法要大得多得多。我们的方法比早期的基于MIP的处理SPCA问题的方法更具计算吸引力。我们关于合成和真实的变量模型的实验和真实数据分析结果显示,2000年的SAC的比较结果中,我们现有的算法和2000年的SADAMAC结果可以普遍地反映。