The importance of explainability is increasingly acknowledged in natural language processing. However, it is still unclear how the quality of explanations can be assessed effectively. The predominant approach is to compare proxy scores (such as BLEU or explanation F1) evaluated against gold explanations in the dataset. The assumption is that an increase of the proxy score implies a higher utility of explanations to users. In this paper, we question this assumption. In particular, we (i) formulate desired characteristics of explanation quality that apply across tasks and domains, (ii) point out how current evaluation practices violate those characteristics, and (iii) propose actionable guidelines to overcome obstacles that limit today's evaluation of explanation quality and to enable the development of explainable systems that provide tangible benefits for human users. We substantiate our theoretical claims (i.e., the lack of validity and temporal decline of currently-used proxy scores) with empirical evidence from a crowdsourcing case study in which we investigate the explanation quality of state-of-the-art explainable question answering systems.
翻译:在自然语言处理过程中,解释的重要性日益得到确认,然而,解释质量如何得到有效评估,目前尚不清楚,主要的做法是将代用评分(如BLEU或解释F1)与数据集中的黄金解释相比较,假设代用评分的增加意味着对用户的解释效用更大。在本文件中,我们质疑这一假设。特别是,我们(一) 制定适用于不同任务和领域的解释质量的理想特征,(二) 指出目前的评价做法如何违反这些特征,(三) 提出可操作的准则,以克服限制今天对解释质量的评价的障碍,并促成建立可解释的系统,为人类用户提供实际好处。我们用一个众包案例研究的经验证据证实了我们的理论主张(即目前使用的代用评分缺乏有效性和暂时下降),我们在该研究中调查了最新解答问题系统的解释质量。