Data integration, or the strategic analysis of multiple sources of data simultaneously, can often lead to discoveries that may be hidden in individualistic analyses of a single data source. We develop a new unsupervised data integration method named Integrated Principal Components Analysis (iPCA), which is a model-based generalization of PCA and serves as a practical tool to find and visualize common patterns that occur in multiple data sets. The key idea driving iPCA is the matrix-variate normal model, whose Kronecker product covariance structure captures both individual patterns within each data set and joint patterns shared by multiple data sets. Building upon this model, we develop several penalized (sparse and non-sparse) covariance estimators for iPCA, and using geodesic convexity, we prove that our non-sparse iPCA estimator converges to the global solution of a non-convex problem. We also demonstrate the practical advantages of iPCA through extensive simulations and a case study application to integrative genomics for Alzheimer's disease. In particular, we show that the joint patterns extracted via iPCA are highly predictive of a patient's cognition and Alzheimer's diagnosis.
翻译:数据集成,或同时对多种数据源进行战略分析,往往会导致发现发现,这些发现可能隐藏在单一数据源的个人分析中。我们开发了一个新的未经监督的数据集集成方法,名为集成主要组成部分分析(iPCA),这是对五氯苯甲醚的一种基于模型的一般分析,是发现和想象在多个数据集中出现的共同模式的实用工具。驱动 iPCA的主要理念是矩阵变量正常模型,其Kronecker产品常识结构捕捉了每个数据集中的单个模式和多个数据集共享的共同模式。我们以这一模型为基础,为iPCA开发了几种受罚(粗略和非粗略)的共变数数据集计算器,并使用了大地测量的共性。我们证明,我们的非剖析的 iPCA 估量器与非凝固问题的全球解决方案相交汇。我们还通过广泛的模拟和案例研究应用来综合基因组分析阿尔茨海默氏病。我们特别表明,通过iPCA 诊断得出的联合模式是高度预测一个病人的共生性。