End-to-end neural Natural Language Processing (NLP) models are notoriously difficult to understand. This has given rise to numerous efforts towards model explainability in recent years. One desideratum of model explanation is faithfulness, i.e. an explanation should accurately represent the reasoning process behind the model's prediction. In this survey, we review over 110 model explanation methods in NLP through the lens of faithfulness. We first discuss the definition and evaluation of faithfulness, as well as its significance for explainability. We then introduce recent advances in faithful explanation, grouping existing approaches into five categories: similarity methods, analysis of model-internal structures, backpropagation-based methods, counterfactual intervention, and self-explanatory models. For each category, we synthesize its representative studies, strengths, and weaknesses. Finally, we summarize their common virtues and remaining challenges, and reflect on future work directions towards faithful explainability in NLP.
翻译:众所周知,很难理解端到端自然语言处理模式(NLP)模式。这导致近年来在示范解释性方面做出了许多努力。示范解释的一个例外是忠诚,即解释应准确地代表模型预测背后的推理过程。在本次调查中,我们从忠诚的角度审查NLP的110多个示范解释方法。我们首先讨论忠诚的定义和评价,以及其解释性的意义。然后,我们介绍在忠实解释方面的最新进展,将现有方法分为五类:相似方法、模型内部结构分析、反事实干预和自我解释模式。我们综合了每个类别的代表性研究、优点和弱点。最后,我们总结了它们的共同优点和仍然存在的挑战,并思考了今后在国家语言规划中忠实解释性的工作方向。