The availability of multi-omics data has revolutionized the life sciences by creating avenues for integrated system-level approaches. Data integration links the information across datasets to better understand the underlying biological processes. However, high-dimensionality, correlations and heterogeneity pose statistical and computational challenges. We propose a general framework, probabilistic two-way partial least squares (PO2PLS), which addresses these challenges. PO2PLS models the relationship between two datasets using joint and data-specific latent variables. For maximum likelihood estimation of the parameters, we implement a fast EM algorithm and show that the estimator is asymptotically normally distributed. A global test for testing the relationship between two datasets is proposed, and its asymptotic distribution is derived. Notably, several existing omics integration methods are special cases of PO2PLS. Via extensive simulations, we show that PO2PLS performs better than alternatives in feature selection and prediction performance. In addition, the asymptotic distribution appears to hold when the sample size is sufficiently large. We illustrate PO2PLS with two examples from commonly used study designs: a large population cohort and a small case-control study. Besides recovering known relationships, PO2PLS also identified novel findings. The methods are implemented in our R-package PO2PLS. Supplementary materials for this article are available online.
翻译:多组数据的可用性通过创造综合系统层面方法的渠道,使生命科学发生革命性的变化。数据整合将跨数据集的信息连接起来,以更好地了解基本生物过程。然而,高维度、相关性和异质性提出了统计和计算方面的挑战。我们提出了一个总体框架,即双向双向半方方方形(PO2PLS),用以应对这些挑战。PO2PLS模型使用联合和特定数据的潜在变量来模拟两个数据集之间的关系。为了对参数进行最大的可能性估计,我们采用了快速EM算法,并表明估计数据是正常分布的。提出了用于测试两个数据集之间关系的全球测试,并提出了统计和计算方面的分布。值得注意的是,现有的几种双向双向半方半方方方方方形(PO2PLS)的整合方法是PO2PLS的特殊案例。我们通过广泛的模拟,显示PO2PLS在特性选择和预测性能方面的表现比替代品要好得多。此外,当样本大小时,我们似乎会保持着一种快速的分布。我们用PO2PLS演示了两个模型来说明我们所了解的常规研究的样本。我们所研究中的大规模的POPLS 。我们所要用的新研究中发现的样本。我们所知道的两组的样本分析方法。