While there is increasing concern about the interpretability of neural models, the evaluation of interpretability remains an open problem, due to the lack of proper evaluation datasets and metrics. In this paper, we present a novel benchmark to evaluate the interpretability of both neural models and saliency methods. This benchmark covers three representative NLP tasks: sentiment analysis, textual similarity and reading comprehension, each provided with both English and Chinese annotated data. In order to precisely evaluate the interpretability, we provide token-level rationales that are carefully annotated to be sufficient, compact and comprehensive. We also design a new metric, i.e., the consistency between the rationales before and after perturbations, to uniformly evaluate the interpretability on different types of tasks. Based on this benchmark, we conduct experiments on three typical models with three saliency methods, and unveil their strengths and weakness in terms of interpretability. We will release this benchmark https://www.luge.ai/#/luge/task/taskDetail?taskId=15 and hope it can facilitate the research in building trustworthy systems.
翻译:虽然人们日益关注神经模型的可解释性,但由于缺乏适当的评价数据集和计量标准,对可解释性的评价仍然是一个尚未解决的问题。我们在本文件中提出了一个新的基准,用以评价神经模型和突出方法的可解释性。这一基准包括三项具有代表性的国家实验室任务:情绪分析、文字相似性和阅读理解,每个中心都提供英文和中文附加说明的数据。为了准确评估可解释性,我们提供了象征性的理由,这些理由经过仔细说明,足以充分、紧凑和全面。我们还设计了一个新的指标,即扰动前后的理由的一致性,以统一评估不同类型任务的可解释性。以这一基准为基础,我们用三种突出的方法对三种典型模型进行实验,并展示这些模型在可解释性方面的长处和短处。我们将公布这个基准 https://www.luge.ai/#luge/task/taskDetask?taskil?taskId=15,并希望它能够促进建立可靠系统的研究。