High-throughput technologies such as next generation sequencing allow biologists to observe cell function with unprecedented resolution, but the resulting datasets are too large and complicated for humans to understand without the aid of advanced statistical methods. Machine learning (ML) algorithms, which are designed to automatically find patterns in data, are well suited to this task. Yet these models are often so complex as to be opaque, leaving researchers with few clues about underlying mechanisms. Interpretable machine learning (iML) is a burgeoning subdiscipline of computational statistics devoted to making the predictions of ML models more intelligible to end users. This article is a gentle and critical introduction to iML, with an emphasis on genomic applications. I define relevant concepts, motivate leading methodologies, and provide a simple typology of existing approaches. I survey recent examples of iML in genomics, demonstrating how such techniques are increasingly integrated into research workflows. I argue that iML solutions are required to realize the promise of precision medicine. However, several open challenges remain. I examine the limitations of current state of the art tools and propose a number of directions for future research. While the horizon for iML in genomics is wide and bright, continued progress requires close collaboration across disciplines.
翻译:高通量技术,例如下一代测序,让生物学家能够以前所未有的分辨率观察细胞功能,但由此产生的数据集过于庞大和复杂,人类无法在没有先进统计方法帮助下理解,因此难以理解。设计旨在自动找到数据模式的机器学习算法非常适合这项任务。然而,这些模型往往非常复杂,使研究人员对基本机制没有多少线索。可解释的机器学习(iML)是一个快速增长的计算统计数据的次纪律,专门用于使ML模型的预测更便于终端用户理解。这篇文章是iML的温和和关键介绍,强调基因组应用。我界定了相关概念,激励主要方法,提供了现有方法的简单类型。我调查了基因组学中最近iML的例子,表明这些技术如何日益融入研究工作流程。我说,需要iML解决方案来实现精确医学的许诺。然而,仍然存在一些公开的挑战。我研究了目前艺术工具的局限性,并提出了一些未来研究的近方向。我定义了相关概念,激励了主要方法,并提供了现有方法的简单类型。我调查了基因组中的许多领域。