Analysis of multi-source dataset, where data on the same objects are collected from multiple sources, is of rising importance in many fields, most notably in multi-omics biology. We propose a novel framework and algorithms for integrative decomposition of such multi-source data, to identify and sort out common factor scores in terms of whether the scores are relevant to all data sources (fully joint), to some data sources (partially joint), or to a single data source. The key difference between our proposal and existing approaches is that we utilize raw source-wise factor score subspaces in the identification of the partially-joint block-wise association structure. To identify common score subspaces, which may be partially joint to some of data sources, from noisy observations, our proposed algorithm sequentially computes one-dimensional flag means among source-wise score subspaces, then collects the subspaces that are close to the mean. The proposed decomposition boasts fast computational speed, and is superior in identifying the true partially-joint association structure and recovering the joint loading and score subspaces than competing approaches. The proposed method is applied to a blood cancer multi-omics data set, containing measurements from three data sources. We identify a latent score, partially joint to the drug panel and methylation profile data sources but not relevant to RNA sequencing profiles, that helps discovering hidden clusters in the data.
翻译:对多来源数据集进行分析,从多个来源收集同一对象的数据,这种分析在许多领域,特别是在多组生物学中的重要性日益提高。我们提议了一个新的框架和算法,用于综合分解这类多源数据,确定和分解这些分数是否与所有数据来源(完全联合)、某些数据来源(部分联合)或单一数据来源相关。我们的提议与现有办法之间的关键区别是,我们在确定部分联合的组合结构时,使用原始源因子分数分数分数。为了确定共同的分数子空间,这些分数可能部分与一些数据来源(从吵闹的观测中)合并,我们提议的算法按顺序计算出在源性分数子空间之间的一维标志,然后收集接近平均数的子空间。拟议的分数以快速计算的速度衡量,在确定真正的部分联合关联结构以及恢复联合的分数分数子空间方面优于相互竞争的方法。拟议的方法适用于血癌多类联的多组别分数子空间,但用于部分数据排序,我们用来确定数据排序中的数据来源,但有助于进行数据排序。