Recent years have witnessed the emergence of a variety of post-hoc interpretations that aim to uncover how natural language processing (NLP) models make predictions. Despite the surge of new interpretations, it remains an open problem how to define and quantitatively measure the faithfulness of interpretations, i.e., to what extent they conform to the reasoning process behind the model. To tackle these issues, we start with three criteria: the removal-based criterion, the sensitivity of interpretations, and the stability of interpretations, that quantify different notions of faithfulness, and propose novel paradigms to systematically evaluate interpretations in NLP. Our results show that the performance of interpretations under different criteria of faithfulness could vary substantially. Motivated by the desideratum of these faithfulness notions, we introduce a new class of interpretation methods that adopt techniques from the adversarial robustness domain. Empirical results show that our proposed methods achieve top performance under all three criteria. Along with experiments and analysis on both the text classification and the dependency parsing tasks, we come to a more comprehensive understanding of the diverse set of interpretations.
翻译:近些年来,出现了各种事后解释,目的是揭示自然语言处理模式是如何作出预测的。尽管新解释激增,但如何界定和定量衡量解释的忠实性仍然是一个未解决的问题,即这些解释在多大程度上符合模型背后的推理过程。为了解决这些问题,我们从三个标准着手:基于删除的标准、解释的敏感性和解释的稳定性,用数量表示不同的忠诚概念,并提出新的范式,以便系统地评价国家语言处理模式的解释。我们的结果表明,在不同忠诚标准下的解释性的表现可能有很大差异。我们受这些忠诚概念的贬低的驱使,我们采用了一种新的解释方法,采用从对抗性强势领域得出的技术。经验性结果显示,我们提出的方法在所有三个标准下都取得了顶尖的成绩。除了对文本分类和依赖性区分任务进行试验和分析外,我们还更全面地了解了多种解释。