The rapid development of high-throughput technologies has enabled the generation of data from biological or disease processes that span multiple layers, like genomic, proteomic or metabolomic data, and further pertain to multiple sources, like disease subtypes or experimental conditions. In this work, we propose a general statistical framework based on Gaussian graphical models for horizontal (i.e. across conditions or subtypes) and vertical (i.e. across different layers containing data on molecular compartments) integration of information in such datasets. We start with decomposing the multi-layer problem into a series of two-layer problems. For each two-layer problem, we model the outcomes at a node in the lower layer as dependent on those of other nodes in that layer, as well as all nodes in the upper layer. We use a combination of neighborhood selection and group-penalized regression to obtain sparse estimates of all model parameters. Following this, we develop a debiasing technique and asymptotic distributions of inter-layer directed edge weights that utilize already computed neighborhood selection coefficients for nodes in the upper layer. Subsequently, we establish global and simultaneous testing procedures for these edge weights. Performance of the proposed methodology is evaluated on synthetic and real data.
翻译:高通量技术的迅速发展使得能够从生物或疾病过程生成数据,这些过程跨越多层,例如基因组、蛋白质组或代谢数据,并且进一步涉及多种来源,例如疾病亚型或实验条件。在这项工作中,我们提议了一个基于高斯图形模型的总体统计框架,用于横向(即跨条件或亚型)和垂直(即跨包含分子间隔数据的不同层)整合此类数据集中的信息。我们从将多层问题分解成一系列两层问题开始。对于每一个两层问题,我们将低层节点的结果建模为取决于该层中其他节点的结果,以及上层的所有节点。我们使用邻里选择和组化回归的组合,以获得所有模型参数的稀疏估计数。之后,我们开发了一种降低偏差的技术,将跨层定向边缘重量分布分解成一系列的两层问题。对于每个两层问题,我们用低层节点的节点的节点将结果建模作为模型的模型,而后,我们用已计算好的邻里选择系数来进行合成水平的同步测试。我们随后对这些数据和合成水平的同步评估。