While there is increasing concern about the interpretability of neural models, the evaluation of interpretability remains an open problem, due to the lack of proper evaluation datasets and metrics. In this paper, we present a novel benchmark to evaluate the interpretability of both neural models and saliency methods. This benchmark covers three representative NLP tasks: sentiment analysis, textual similarity and reading comprehension, each provided with both English and Chinese annotated data. In order to precisely evaluate the interpretability, we provide token-level rationales that are carefully annotated to be sufficient, compact and comprehensive. We also design a new metric, i.e., the consistency between the rationales before and after perturbations, to uniformly evaluate the interpretability of models and saliency methods on different tasks. Based on this benchmark, we conduct experiments on three typical models with three saliency methods, and unveil their strengths and weakness in terms of interpretability. We will release this benchmark at \url{https://xyz} and hope it can facilitate the research in building trustworthy systems.
翻译:虽然人们日益关注神经模型的可解释性,但由于缺乏适当的评价数据集和计量标准,对可解释性的评价仍是一个尚未解决的问题。在本文件中,我们提出了一个新的基准,用以评价神经模型和突出方法的可解释性。该基准包括三项具有代表性的国家实验室任务:情绪分析、文字相似性和阅读理解,每个中心都提供英文和中文附加说明的数据。为了准确评估可解释性,我们提供了象征性的理由,这些理由经过仔细说明,足以充分、紧凑和全面。我们还设计了一个新的指标,即扰动前后的理由的一致性,以统一评价模型和不同任务突出方法的可解释性。根据这一基准,我们用三种突出的方法对三种典型模型进行实验,并展示其优点和弱点。我们将在\url{https://xyz}公布这一基准,并希望它能够促进建立可信赖系统的研究。