The availability of multi-modality datasets provides a unique opportunity to characterize the same object of interest using multiple viewpoints more comprehensively. In this work, we investigate the use of canonical correlation analysis (CCA) and penalized variants of CCA (pCCA) for the fusion of two modalities. We study a simple graphical model for the generation of two-modality data. We analytically show that, with known model parameters, posterior mean estimators that jointly use both modalities outperform arbitrary linear mixing of single modality posterior estimators in latent variable prediction. Penalized extensions of CCA (pCCA) that incorporate domain knowledge can discover correlations with high-dimensional, low-sample data, whereas traditional CCA is inapplicable. To facilitate the generation of multi-dimensional embeddings with pCCA, we propose two matrix deflation schemes that enforce desirable properties exhibited by CCA. We propose a two-stage prediction pipeline using pCCA embeddings generated with deflation for latent variable prediction by combining all the above. On simulated data, our proposed model drastically reduces the mean-squared error in latent variable prediction. When applied to publicly available histopathology data and RNA-sequencing data from The Cancer Genome Atlas (TCGA) breast cancer patients, our model can outperform principal components analysis (PCA) embeddings of the same dimension in survival prediction.
翻译:多式数据集的可用性为利用多种观点更全面地确定同一对象提供了一次独特的机会,以便更全面地利用多种观点来描述同一对象。在这项工作中,我们调查了使用Caponical 相关分析(CCA)和CCCA(CCA)中受惩罚的变体来混合两种模式。我们研究了用于生成两种模式数据的简单图形模型模型。我们分析表明,两种模式使用已知模型参数,后代平均估计器联合使用两种模式,在潜在变量预测中,超越单一模式后天估计器的任意线性混合。纳入域知识的CCA(PCCA)的处罚扩展可以发现与高度低量数据的相关性,而传统的CCA则不适用。为了便于生成多维嵌入两种模式数据,我们提出了两种矩阵通缩计划,以实施CACC所展示的相同属性。我们建议用PCCA的嵌入式模型进行两个阶段的预测管道,以便结合上述所有因素进行隐性变数预测。关于模拟数据,我们提议的CARC模型将潜在可变误差部分从核心可变预测中大幅减少潜在可变数据,而传统的CC的主要CEARisma-CA数据从可公开运用。