In this paper, we propose a novel robust Principal Component Analysis (PCA) for high-dimensional data in the presence of various heterogeneities, especially the heavy-tailedness and outliers. A transformation motivated by the characteristic function is constructed to improve the robustness of the classical PCA. Besides the typical outliers, the proposed method has the unique advantage of dealing with heavy-tail-distributed data, whose covariances could be nonexistent (positively infinite, for instance). The proposed approach is also a case of kernel principal component analysis (KPCA) method and adopts the robust and non-linear properties via a bounded and non-linear kernel function. The merits of the new method are illustrated by some statistical properties including the upper bound of the excess error and the behaviors of the large eigenvalues under a spiked covariance model. In addition, we show the advantages of our method over the classical PCA by a variety of simulations. At last, we apply the new robust PCA to classify mice with different genotypes in a biological study based on their protein expression data and find that our method is more accurately on identifying abnormal mice comparing to the classical PCA.
翻译:在本文中,我们提出了一种新颖的强健的主元件分析(PCA),用于在存在各种差异的情况下提供高维数据,特别是重尾和外部离子。由特性函数驱动的转变是为了提高古典五氯苯甲醚的稳健性而设计的。除了典型的外端外端外,拟议方法具有处理重尾分配数据的独特优势,其共差可能不存在(例如,积极的无限性)。提议的方法也是内核主要元件分析(KPCA)方法的一个实例,它通过一种封闭和非线性内核函数采用强健和非线性特性。一些统计属性说明了新方法的优点,包括超误的上层和在一种螺旋共差模型下的巨大叶素值的行为。此外,我们还通过多种模拟,展示了我们的方法优于经典五氯苯甲醚的优势。最后,我们运用新的强健健的五氯苯甲醚,在生物研究中,根据蛋白表现数据对不同基因型小鼠进行分类,并发现我们的方法更准确地识别了古典的甲醚。