International initiatives such as METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) have collected several multigenomic and clinical data sets to identify the undergoing molecular processes taking place throughout the evolution of various cancers. Numerous Machine Learning and statistical models have been designed and trained to analyze these types of data independently, however, the integration of such differently shaped and sourced information streams has not been extensively studied. To better integrate these data sets and generate meaningful representations that can ultimately be leveraged for cancer detection tasks could lead to giving well-suited treatments to patients. Hence, we propose a novel learning pipeline comprising three steps - the integration of cancer data modalities as graphs, followed by the application of Graph Neural Networks in an unsupervised setting to generate lower-dimensional embeddings from the combined data, and finally feeding the new representations on a cancer sub-type classification model for evaluation. The graph construction algorithms are described in-depth as METABRIC does not store relationships between the patient modalities, with a discussion of their influence over the quality of the generated embeddings. We also present the models used to generate the lower-latent space representations: Graph Neural Networks, Variational Graph Autoencoders and Deep Graph Infomax. In parallel, the pipeline is tested on a synthetic dataset to demonstrate that the characteristics of the underlying data, such as homophily levels, greatly influence the performance of the pipeline, which ranges between 51\% to 98\% accuracy on artificial data, and 13\% and 80\% on METABRIC. This project has the potential to improve cancer data understanding and encourages the transition of regular data sets to graph-shaped data.
翻译:乳腺癌国际联盟分子分类等国际举措收集了多个多基因和临床数据集,以查明各种癌症演变过程中正在发生的分子过程,设计并培训了许多机器学习和统计模型,以独立分析这类类型的数据,然而,尚未对不同形状和来源的信息流进行广泛研究;为了更好地整合这些数据组,并产生可最终用于癌症检测任务的有意义的表述,可以使患者得到更合适的治疗。因此,我们提议建立一个由三个步骤组成的新颖学习管道,即将癌症数据模式整合为图表,然后在未经监督的环境中应用神经网络,从综合数据中产生较低层次的嵌入,最后将新的表述纳入癌症亚型分类模型,以进行评估。图构建算法深度描述为:METABICR不会存储患者模式之间的关系,讨论其对所生成的嵌入质量的影响。 我们还提出用于生成较低水平的癌症数据模型,将低层神经网络应用成直线路路透图,将数据转换为18级的直径直径直径直径图。