Identifying the genes and mutations that drive the emergence of tumors is a major step to improve understanding of cancer and identify new directions for disease diagnosis and treatment. Despite the large volume of genomics data, the precise detection of driver mutations and their carrying genes, known as cancer driver genes, from the millions of possible somatic mutations remains a challenge. Computational methods play an increasingly important role in identifying genomic patterns associated with cancer drivers and developing models to predict driver events. Machine learning (ML) has been the engine behind many of these efforts and provides excellent opportunities for tackling remaining gaps in the field. Thus, this survey aims to perform a comprehensive analysis of ML-based computational approaches to identify cancer driver mutations and genes, providing an integrated, panoramic view of the broad data and algorithmic landscape within this scientific problem. We discuss how the interactions among data types and ML algorithms have been explored in previous solutions and outline current analytical limitations that deserve further attention from the scientific community. We hope that by helping readers become more familiar with significant developments in the field brought by ML, we may inspire new researchers to address open problems and advance our knowledge towards cancer driver discovery.
翻译:查明导致肿瘤出现的各种基因和突变是提高癌症认识和确定疾病诊断和治疗新方向的一个重要步骤。尽管基因组数据数量庞大,但从数百万种可能的体形变异中精确检测出驱动变异及其携带基因(称为癌症驱动基因)仍然是一个挑战。计算方法在查明与癌症驱动因素有关的基因和突变模式以及开发预测驱动事件模型方面发挥着越来越重要的作用。机器学习(ML)是许多这些努力的引擎,为解决该领域仍然存在的差距提供了极好的机会。因此,这项调查旨在对基于ML的计算方法进行全面分析,以查明癌症驱动因素变异和基因,提供在这一科学问题范围内对广泛数据和算法景观的综合全面看法。我们讨论如何在以前的解决方案中探索数据类型和ML算法之间的互动,并概述目前值得科学界进一步注意的分析局限性。我们希望,通过帮助读者更加熟悉ML提出的该领域的重大发展动态,我们可能会激励新的研究人员解决公开的问题,并推进我们有关癌症驱动力发现的知识。