The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed in the real world. However, they are generally difficult to analyze, raising concerns about using them without a rigorous understanding of how they function. Effective tools for interpreting them will be important for building more trustworthy AI by helping to identify failures, fix bugs, and improve basic understanding. In particular, "inner" interpretability techniques, which focus on explaining the internal components of DNNs, are well-suited for developing a mechanistic understanding, guiding manual modifications, and reverse engineering solutions. Much recent work has focused on DNN interpretability, and rapid progress has thus far made a thorough systematization of methods difficult. In this survey, we review over 300 works with a focus on inner interpretability tools. We introduce a taxonomy that classifies methods by what part of the network they help to explain (weights, neurons, subnetworks, or latent representations) and whether they are implemented during (intrinsic) or after (post hoc) training. To our knowledge, we are also the first to survey a number of connections between interpretability research and work in adversarial robustness, continual learning, modularity, network compression, and studying the human visual system. Finally, we discuss key challenges and argue for future work emphasizing diagnostics, benchmarking, and robustness.
翻译:过去十年的机器学习在规模和能力上都急剧增加。深神经网络(DNN)越来越多地部署在现实世界中。然而,这些网络一般难以分析,引起人们对如何使用这些网络的关切,而没有严格了解这些网络如何运作。有效的解释工具对建立更可靠的AI十分重要,有助于识别失败、纠正错误和增进基本理解。特别是,“内向”解释技术,侧重于解释DNN的内部组成部分,非常适合发展机械化的理解、指导人工修改和反向工程解决方案。最近许多工作都侧重于DNN的可解释性,而快速进展迄今使方法的系统化变得很困难。在这次调查中,我们审查300多个工作,重点是内部可解释工具。我们引入一种分类方法,根据网络中哪些部分有助于解释(重量、神经、子网络或潜在表达),以及这些技术是否在(内部)或(后期)培训中实施。我们首先注重DNNN的可解释性,而且快速进展使得迄今为止的方法很难彻底系统系统化。我们还要研究一个核心的可视化的网络解释性研究、最终的网络的可判性,我们还要研究关于人类可判性研究的可判性、可判性的研究。