Genomics data such as RNA gene expression, methylation and micro RNA expression are valuable sources of information for various clinical predictive tasks. For example, predicting survival outcomes, cancer histology type and other patients' related information is possible using not only clinical data but molecular data as well. Moreover, using these data sources together, for example in multitask learning, can boost the performance. However, in practice, there are many missing data points which leads to significantly lower patient numbers when analysing full cases, which in our setting refers to all modalities being present. In this paper we investigate how imputing data with missing values using deep learning coupled with multitask learning can help to reach state-of-the-art performance results using combined genomics modalities, RNA, micro RNA and methylation. We propose a generalised deep imputation method to impute values where a patient has all modalities present except one. Interestingly enough, deep imputation alone outperforms multitask learning alone for the classification and regression tasks across most combinations of modalities. In contrast, when using all modalities for survival prediction we observe that multitask learning alone outperforms deep imputation alone with statistical significance (adjusted p-value 0.03). Thus, both approaches are complementary when optimising performance for downstream predictive tasks.
翻译:基因组数据,例如RNA基因表达、甲基化和微RNA表达法等,是各种临床预测任务的宝贵信息来源。例如,利用临床数据以及分子数据,预测生存结果、癌症病理学类型和其他病人相关信息是可能的。此外,利用这些数据源,例如多任务学习,可以提高性能。然而,在实践中,有许多缺失的数据点,导致在分析全面病例时病人人数明显下降,而我们在设置时指的是所有模式存在。在本文件中,我们调查如何利用深度学习加上多任务学习来计算数据与缺失值一起计算数据,帮助利用综合基因学模式、RNA、微RNA和甲基化等综合基因学方法取得最先进的性能结果。我们提出了一种通用的深度估算方法,在患者除了一种模式存在各种模式的情况下,对数值进行估算。有趣的是,只有深度估算才能使多任务单从多任务中学习,而每个模式的分类和回归任务都是存在的。相比之下,在使用所有生存预测方法时,我们发现多任务和多任务组合学习的同时,单靠统计价值来进行下游评估。