When the dimension of data is comparable to or larger than the number of data samples, Principal Components Analysis (PCA) may exhibit problematic high-dimensional noise. In this work, we propose an Empirical Bayes PCA method that reduces this noise by estimating a joint prior distribution for the principal components. EB-PCA is based on the classical Kiefer-Wolfowitz nonparametric MLE for empirical Bayes estimation, distributional results derived from random matrix theory for the sample PCs, and iterative refinement using an Approximate Message Passing (AMP) algorithm. In theoretical "spiked" models, EB-PCA achieves Bayes-optimal estimation accuracy in the same settings as an oracle Bayes AMP procedure that knows the true priors. Empirically, EB-PCA significantly improves over PCA when there is strong prior structure, both in simulation and on quantitative benchmarks constructed from the 1000 Genomes Project and the International HapMap Project. An illustration is presented for analysis of gene expression data obtained by single-cell RNA-seq.
翻译:在这项工作中,我们建议采用“经验型贝耶斯”五氯苯甲醚方法,通过估计主要组成部分的先前联合分布来减少这种噪音。EB-PCA基于古典Kiefer-Wolfowitz的非参数MLE,用于经验型贝叶估计,来自抽样PC的随机矩阵理论的分布结果,以及使用“近似消息传递”算法的迭接精炼。在理论性“喷射”模型中,EB-PCA作为了解真实前程的甲骨骼AMP程序,在同一环境中达到巴耶斯-最佳估计精度。在模拟和从1000个基因组项目和国际哈普马普项目中构建的定量基准方面,EPB-PCA在具有很强的先前结构时大大改进了五氯苯。为分析单细胞RNA-Seq获得的基因表达数据提供了插图。