Principal Component Analysis (PCA) is a well known procedure to reduce intrinsic complexity of a dataset, essentially through simplifying the covariance structure or the correlation structure. We introduce a novel algebraic, model-based point of view and provide in particular an extension of the PCA to distributions without second moments by formulating the PCA as a best low rank approximation problem. In contrast to hitherto existing approaches, the approximation is based on a kind of spectral representation, and not on the real space. Nonetheless, the prominent role of the eigenvectors is here reduced to define the approximating surface and its maximal dimension. In this perspective, our approach is close to the original idea of Pearson (1901) and hence to autoencoders. Since variable selection in linear regression can be seen as a special case of our extension, our approach gives some insight, why the various variable selection methods, such as forward selection and best subset selection, cannot be expected to coincide. The linear regression model itself and the PCA regression appear as limit cases.
翻译:主要组成部分分析(PCA)是一个众所周知的程序,主要通过简化共变结构或相关结构来降低数据集的内在复杂性。我们引入了一个新的代数、基于模型的视角,并特别规定将五氯苯甲醚扩展至没有第二次时间的分布,将五氯苯甲醚作为最佳低级近似问题。与迄今采用的方法相比,近似是基于一种光谱代表,而不是真实空间。然而,从源头的突出作用在此缩小,以界定相近表面及其最大维度。从这个角度看,我们的方法接近于Pearson(1901年)的原始概念,因此也接近于自动回归者。由于线性回归中的变量选择可被视为我们扩展的一个特殊案例,我们的方法提供了一些洞察力,为什么各种变量选择方法,例如远端选择和最佳子选择,不可能相互一致。线性回归模型本身和五氯苯甲醚回归作为有限案例出现。