Integrative learning of multiple datasets has the potential to mitigate the challenge of small $n$ and large $p$ that is often encountered in analysis of big biomedical data such as genomics data. Detection of weak yet important signals can be enhanced by jointly selecting features for all datasets. However, the set of important features may not always be the same across all datasets. Although some existing integrative learning methods allow heterogeneous sparsity structure where a subset of datasets can have zero coefficients for some selected features, they tend to yield reduced efficiency, reinstating the problem of losing weak important signals. We propose a new integrative learning approach which can not only aggregate important signals well in homogeneous sparsity structure, but also substantially alleviate the problem of losing weak important signals in heterogeneous sparsity structure. Our approach exploits a priori known graphical structure of features and encourages joint selection of features that are connected in the graph. Integrating such prior information over multiple datasets enhances the power, while also accounting for the heterogeneity across datasets. Theoretical properties of the proposed method are investigated. We also demonstrate the limitations of existing approaches and the superiority of our method using a simulation study and analysis of gene expression data from ADNI.
翻译:对多个数据集进行综合学习,有可能减轻在分析基因组数据等大型生物医学数据时经常遇到的小额美元和大额美元的挑战。通过联合选择所有数据集的特征,可以加强对薄弱但重要信号的检测。然而,所有数据集中的重要特征不一定总是相同。虽然一些现有的综合学习方法允许多种多元性结构,其中一组数据集可以对某些选定的特征产生零系数,但往往会降低效率,重新发现弱小重要信号的流失问题。我们提出了一种新的综合学习方法,不仅能够将同质散变结构中的重要信号汇集在一起,而且还能大大减轻各异性散变结构中失去薄弱重要信号的问题。我们的方法利用了先前已知的特征图形结构,并鼓励共同选择与图表中连接的特征。将先前的这类信息结合到多个数据集中可以增强力量,同时也计算出跨数据集的异质性。对拟议方法的理论特性进行了调查。我们还从模拟中展示了现有方法的局限性,并展示了我们方法的基因模拟分析。