Principal component analysis (PCA) is a most frequently used statistical tool in almost all branches of data science. However, like many other statistical tools, there is sometimes the risk of misuse or even abuse. In this paper, we highlight possible pitfalls in using the theoretical results of PCA based on the assumption of independent data when the data are time series. For the latter, we state with proof a central limit theorem of the eigenvalues and eigenvectors (loadings), give direct and bootstrap estimation of their asymptotic covariances, and assess their efficacy via simulation. Specifically, we pay attention to the proportion of variation, which decides the number of principal components (PCs), and the loadings, which help interpret the meaning of PCs. Our findings are that while the proportion of variation is quite robust to different dependence assumptions, the inference of PC loadings requires careful attention. We initiate and conclude our investigation with an empirical example on portfolio management, in which the PC loadings play a prominent role. It is given as a paradigm of correct usage of PCA for time series data.
翻译:主要组成部分分析(PCA)是几乎所有数据科学分支中最常用的统计工具,然而,与其他许多统计工具一样,有时也存在滥用或甚至滥用的风险。在本文件中,我们强调在数据为时间序列时,根据独立数据的假设,使用五氯苯的理论结果可能存在陷阱。对于数据为时间序列,我们用证据说明,在使用五氯苯的理论结果时可能存在陷阱。对于后者,我们用一个核心限度来说明电子元值和二次元体(装载)的理论,直接地和靴套地估计其无症状的变量,并通过模拟来评估其效力。具体地说,我们注意差异的比例,它决定了主要组成部分(PCs)的数量,而负荷则有助于解释PCs的含义。我们的调查结果是,虽然变化的比例与不同的依赖性假设相当强,但PC负荷的推论需要认真注意。我们开始并结束我们的调查,在组合管理方面有一个经验实例,其中PC负荷起着突出的作用。我们把它作为正确使用五氯苯的时间序列数据的一个范例。