Classical canonical correlation analysis (CCA) requires matrices to be low dimensional, i.e. the number of features cannot exceed the sample size. Recent developments in CCA have mainly focused on the high-dimensional setting, where the number of features in both matrices under analysis greatly exceeds the sample size. These approaches impose penalties in the optimization problems that are needed to be solve iteratively, and estimate multiple canonical vectors sequentially. In this work, we provide an explicit link between sparse multiple regression with sparse canonical correlation analysis, and an efficient algorithm that can estimate multiple canonical pairs simultaneously rather than sequentially. Furthermore, the algorithm naturally allows parallel computing. These properties make the algorithm much efficient. We provide theoretical results on the consistency of canonical pairs. The algorithm and theoretical development are based on solving an eigenvectors problem, which significantly differentiate our method with existing methods. Simulation results support the improved performance of the proposed approach. We apply eigenvector-based CCA to analysis of the GTEx thyroid histology images, analysis of SNPs and RNA-seq gene expression data, and a microbiome study. The real data analysis also shows improved performance compared to traditional sparse CCA.
翻译:古典古典线性关联分析(CCA)要求矩阵是低维的,即特征数量不能超过样本大小。最近CCA的发展主要侧重于高维环境,分析中的两个矩阵的特征数量大大超过样本大小。这些方法对优化问题规定了惩罚,需要迭接解决,并按顺序估计多种导体矢量。在这项工作中,我们提供了一种明确的联系,将稀薄的多重回归与稀薄的光谱相关分析联系起来,而一种高效的算法可以同时而不是按顺序估计多孔对数。此外,算法自然允许平行计算。这些特性使得算法效率很高。我们提供了对金刚性对数一致性的理论结果。这种算法和理论发展的基础是解决密封生物体问题,这大大区别了我们的方法和现有方法。模拟结果支持了拟议方法的改进性。我们用树脂色化的计算法用于分析GTEX甲状腺基因图象、分析SNPs和RNA-seq基因表达数据,以及一种微生物性分析。数据也显示对MRIA的改进了实际性分析。