模拟数据和实证数据与部分最低平方分析比较 (Comparison of Canonical Correlation and Partial Least Squares analyses of simulated and empirical data)

In this paper, we compared the general forms of CCA and PLS on three simulated and two empirical datasets, all having large sample sizes. We took successively smaller subsamples of these data to evaluate sensitivity, reliability, and reproducibility. In null data having no correlation within or between blocks, both methods showed equivalent false positive rates across sample sizes. Both methods also showed equivalent detection in data with weak but reliable effects until sample sizes drop below n=50. In the case of strong effects, both methods showed similar performance unless the correlations of items within one data block were high. For PLS, the results were reproducible across sample sizes for strong effects, except at the smallest sample sizes. On the contrary, the reproducibility for CCA declined when the within-block correlations were high. This was ameliorated if a principal components analysis (PCA) was performed and the component scores used to calculate the cross-block matrix. The outcome of our examination gives three messages. First, for data with reasonable within and between block structure, CCA and PLS give comparable results. Second, if there are high correlations within either block, this can compromise the reliability of CCA results. This known issue of CCA can be remedied with PCA before cross-block calculation. This, however, assumes that the PCA structure is stable for a given sample. Third, null hypothesis testing does not guarantee that the results are reproducible, even with large sample sizes. This final outcome suggests that both statistical significance and reproducibility be assessed for any data.

翻译：在本文中,我们比较了三个模拟数据和两个实验数据集的共同国家评析和PLS的一般形式,这三个模拟数据和两个实验数据集都具有较大的抽样规模。我们连续对这些数据进行了较小的子样本,以评估敏感度、可靠性和可复制性。在无关联的区块内或区块间数据中,两种方法在抽样大小之间都显示出相等的假正率。两种方法还显示在数据中检测到微弱但可靠的效果,直到样本大小下降到n=50以下。在效果强劲的情况下,两种方法都表现出类似的性能,除非一个数据区块内项目的相关性很高。对于PLS,结果在抽样大小之间可以重新复制。对于PLS,结果在抽样大小之间,除了最小的样本大小之外,我们连续复制。相反,如果在区块内的相关性很高,那么在相互对共同国家评分的相互关系较高,那么如果进行主要组成部分分析(PCA)和用来计算交叉矩阵的任何组成部分的评分数,则会得到改进。我们进行的检查的结果是三个。首先,对于在区块结构内和相互比较结构之间的数据具有合理性的数据具有可比较的结果。第二个问题,即使在统计结构内具有较高的相关性,那么,那么,在计算中,这种结果之中的可靠性的这种结果是可靠的,那么,这种结果的推论的推论的推而后,这种结果的推而具有较的推而具有较的推。