Along with the great success of deep neural networks, there is also growing concern about their black-box nature. The interpretability issue affects people's trust on deep learning systems. It is also related to many ethical problems, e.g., algorithmic discrimination. Moreover, interpretability is a desired property for deep networks to become powerful tools in other research fields, e.g., drug discovery and genomics. In this survey, we conduct a comprehensive review of the neural network interpretability research. We first clarify the definition of interpretability as it has been used in many different contexts. Then we elaborate on the importance of interpretability and propose a novel taxonomy organized along three dimensions: type of engagement (passive vs. active interpretation approaches), the type of explanation, and the focus (from local to global interpretability). This taxonomy provides a meaningful 3D view of distribution of papers from the relevant literature as two of the dimensions are not simply categorical but allow ordinal subcategories. Finally, we summarize the existing interpretability evaluation methods and suggest possible research directions inspired by our new taxonomy.
翻译:随着深层神经网络的巨大成功,人们也日益关注其黑箱性质。可解释性问题影响人们对深层学习系统的信任。它也与许多伦理问题有关,例如算法歧视。此外,可解释性是深层网络成为其他研究领域(例如毒品发现和基因组学)的强大工具的理想属性。在这次调查中,我们全面审查了神经网络可解释性研究。我们首先澄清了可解释性的定义,因为解释性在许多不同背景下都曾使用过。然后我们阐述了可解释性的重要性,并提出了一个新的分类学,分为三个方面:参与类型(被动与积极解释方法)、解释类型和重点(从局部到全球解释)。这一分类提供了从相关文献中分发文件的有意义的三维观点,因为两个方面不仅明确,而且允许分层分类。最后,我们总结了现有的可解释性评估方法,并提出了我们新的分类学所启发的可能研究方向。