Analyzing multi-source data, which are multiple views of data on the same subjects, has become increasingly common in molecular biomedical research. Recent methods have sought to uncover underlying structure and relationships within and/or between the data sources, and other methods have sought to build a predictive model for an outcome using all sources. However, existing methods that do both are presently limited because they either (1) only consider data structure shared by all datasets while ignoring structures unique to each source, or (2) they extract underlying structures first without consideration to the outcome. We propose a method called supervised joint and individual variation explained (sJIVE) that can simultaneously (1) identify shared (joint) and source-specific (individual) underlying structure and (2) build a linear prediction model for an outcome using these structures. These two components are weighted to compromise between explaining variation in the multi-source data and in the outcome. Simulations show sJIVE to outperform existing methods when large amounts of noise are present in the multi-source data. An application to data from the COPDGene study reveals gene expression and proteomic patterns that are predictive of lung function. Functions to perform sJIVE are included in the R.JIVE package, available online at http://github.com/lockEF/r.jive .
翻译:在分子生物医学研究中,分析多来源数据是同一主题的数据的多重观点,这种多来源分析数据越来越普遍,在分子生物医学研究中日益常见。最近的方法试图发现数据源内和/或数据源间的基本结构和关系,而其他方法则试图建立利用所有来源得出结果的预测模型。然而,目前这两种方法都很有限,因为它们要么只考虑所有数据集共享的数据结构,而忽略每个来源独有的结构,或者(2)它们首先提取基础结构,而不考虑结果。我们提议一种称为 " 监督联合和个人变异 " (sJIVE)的方法,可以同时(1) 查明共同(联合)和源(个人)基本结构,(2) 为使用这些结构的结果建立一个线性预测模型。这两个组成部分在解释多来源数据和结果的差异时相互折中。模拟显示,当多来源数据中存在大量噪音时,它们会超越现有方法。对COPDGene研究数据的应用揭示出预测肺功能的基因表现和预估模式。在RGIVE/MIVEVE中可以在线获得的SYP/MIVERlock软件包件。