The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed in the real world. However, they are difficult to analyze, raising concerns about using them without a rigorous understanding of how they function. Effective tools for interpreting them will be important for building more trustworthy AI by helping to identify problems, fix bugs, and improve basic understanding. In particular, "inner" interpretability techniques, which focus on explaining the internal components of DNNs, are well-suited for developing a mechanistic understanding, guiding manual modifications, and reverse engineering solutions. Much recent work has focused on DNN interpretability, and rapid progress has thus far made a thorough systematization of methods difficult. In this survey, we review over 300 works with a focus on inner interpretability tools. We introduce a taxonomy that classifies methods by what part of the network they help to explain (weights, neurons, subnetworks, or latent representations) and whether they are implemented during (intrinsic) or after (post hoc) training. To our knowledge, we are also the first to survey a number of connections between interpretability research and work in adversarial robustness, continual learning, modularity, network compression, and studying the human visual system. We discuss key challenges and argue that the status quo in interpretability research is largely unproductive. Finally, we highlight the importance of future work that emphasizes diagnostics, debugging, adversaries, and benchmarking in order to make interpretability tools more useful to engineers in practical applications.
翻译:过去十年的机器学习在规模和能力上都急剧增加。深神经网络(DNN)越来越多地被部署在现实世界中。然而,它们很难分析,引起人们对如何使用这些网络的关切,而没有严格了解其如何运作。有效的解释工具对于通过帮助查明问题、纠正错误和增进基本理解来建立更值得信赖的AI非常重要。特别是,“内向”解释技术,侧重于解释DNN的内部组成部分,非常适合发展机械化的运用,指导人工修改和反向工程解决方案。最近许多工作都侧重于DNN的可解释性,而迅速的进展迄今使得方法的系统化变得很困难。在这次调查中,我们审查300多个工作时,重点是内部可解释工具。我们引入一种分类方法,根据它们有助于解释的部分(重量、神经、子网络或潜伏表),以及它们是在(内向)或(后)培训期间实施的。我们的知识侧重于DNNN的可解释性解释性,迄今为止,我们也是在研究核心的网络和变现性研究中,我们首先研究了核心的可变性,在研究中,我们学习了核心的网络的可变性、可变性、可变性,在研究中,在研究中,在研究中,我们研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,我们最后的可变性研究中,在研究的可变式的可变性研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,对等的关键性,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在分析性方面,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在研究中,在