Principal component analysis (PCA) is a classical and ubiquitous method for reducing data dimensionality, but it is suboptimal for heterogeneous data that are increasingly common in modern applications. PCA treats all samples uniformly so degrades when the noise is heteroscedastic across samples, as occurs, e.g., when samples come from sources of heterogeneous quality. This paper develops a probabilistic PCA variant that estimates and accounts for this heterogeneity by incorporating it in the statistical model. Unlike in the homoscedastic setting, the resulting nonconvex optimization problem is not seemingly solved by singular value decomposition. This paper develops a heteroscedastic probabilistic PCA technique (HePPCAT) that uses efficient alternating maximization algorithms to jointly estimate both the underlying factors and the unknown noise variances. Simulation experiments illustrate the comparative speed of the algorithms, the benefit of accounting for heteroscedasticity, and the seemingly favorable optimization landscape of this problem. Real data experiments on environmental air quality data show that HePPCAT can give a better PCA estimate than techniques that do not account for heteroscedasticity.
翻译:主要成分分析(PCA)是减少数据维度的典型和无处不在的方法,但对于现代应用中日益常见的多种数据来说,它并不最理想。当各种样品的噪音是异质性时,例如样品来自不同质量的来源时,五氯苯甲醚对所有样品的处理均匀地降解。本文开发了一个概率性五氯苯甲醚变异物变异物,该变异物通过将其纳入统计模型来估计和核算这种异质性。与同质体环境不同,由此产生的非相异性优化问题似乎不是由单值分解而解决的。本文开发了一种超度分解的五氯苯甲醚概率技术(HEPPCAT),使用高效交替最大化算法共同估计基本因素和未知的噪声差异。模拟实验说明了算法的比较速度、计算异性的好处以及这一问题的看似最佳景观。关于环境空气质量数据的实际实验显示,HPPCAT能够提供比不核算六氧基技术更好的五氯苯甲醚估计。