用于高分层混血体数据的最佳估计五氯苯甲醚 (Optimally Weighted PCA for High-Dimensional Heteroscedastic Data)

Modern data are increasingly both high-dimensional and heteroscedastic. This paper considers the challenge of estimating underlying principal components from high-dimensional data with noise that is heteroscedastic across samples, i.e., some samples are noisier than others. Such heteroscedasticity naturally arises, e.g., when combining data from diverse sources or sensors. A natural way to account for this heteroscedasticity is to give noisier blocks of samples less weight in PCA by using the leading eigenvectors of a weighted sample covariance matrix. We consider the problem of choosing weights to optimally recover the underlying components. In general, one cannot know these optimal weights since they depend on the underlying components we seek to estimate. However, we show that under some natural statistical assumptions the optimal weights converge to a simple function of the signal and noise variances for high-dimensional data. Surprisingly, the optimal weights are not the inverse noise variance weights commonly used in practice. We demonstrate the theoretical results through numerical simulations and comparisons with existing weighting schemes. Finally, we briefly discuss how estimated signal and noise variances can be used when the true variances are unknown, and we illustrate the optimal weights on real data from astronomy.

翻译：现代数据日益具有高度和超强性质。本文考虑了从高度数据中估算主要组成部分的内在组成部分的挑战。高度数据具有不同样品的杂交性, 即有些样品比其他样品的杂交性更新。这种杂交性自然产生, 例如, 将来自不同来源或传感器的数据合并起来。计算这种杂交性的一种自然方法, 是使用加权样本变异矩阵的主要偏差因素, 使五氯苯样本中的杂交区块减少重量。我们考虑选择重量的问题, 以便最佳地恢复基本部件。一般来说, 我们无法了解这些最佳加权, 因为它们取决于我们所要估计的基本组成部分。然而, 我们在某些自然统计假设下, 最佳重量与高度数据信号和噪音差异的简单函数相交汇。令人惊讶的是, 最佳重量不是实践中常用的反噪音差异重量。我们通过数字模拟和与现有加权方案比较来展示理论结果。总的来说, 我们无法了解这些最佳重量的重量, 当我们用未知的信号和天平面数据来说明我们如何估计真实差异时, 我们简要地分析。