The last decade of machine learning has seen drastic increases in scale and capabilities, and deep neural networks (DNNs) are increasingly being deployed across a wide range of domains. However, the inner workings of DNNs are generally difficult to understand, raising concerns about the safety of using these systems without a rigorous understanding of how they function. In this survey, we review literature on techniques for interpreting the inner components of DNNs, which we call "inner" interpretability methods. Specifically, we review methods for interpreting weights, neurons, subnetworks, and latent representations with a focus on how these techniques relate to the goal of designing safer, more trustworthy AI systems. We also highlight connections between interpretability and work in modularity, adversarial robustness, continual learning, network compression, and studying the human visual system. Finally, we discuss key challenges and argue for future work in interpretability for AI safety that focuses on diagnostics, benchmarking, and robustness.
翻译:在过去的十年中,机器学习的规模和能力急剧增加,深神经网络(DNNs)正越来越多地在广泛的领域部署,然而,DNNs的内部工作通常难以理解,使人们对使用这些系统的安全性感到担忧,而没有严格了解这些系统是如何运作的。在本次调查中,我们审查了关于解释DNs内在组成部分的技术的文献,我们称之为“内在”解释方法。具体地说,我们审查了解释重量、神经元、子网络和潜在表现的方法,重点是这些技术如何与设计更安全、更值得信赖的AI系统的目标相关。我们还强调了在模块化、对抗性强健、持续学习、网络压缩和研究人类视觉系统方面的可解释性和工作之间的联系。最后,我们讨论了关键的挑战,并主张今后如何解释以诊断、基准和稳健性为重点的AI安全。