While much research focused on producing explanations, it is still unclear how the produced explanations' quality can be evaluated in a meaningful way. Today's predominant approach is to quantify explanations using proxy scores which compare explanations to (human-annotated) gold explanations. This approach assumes that explanations which reach higher proxy scores will also provide a greater benefit to human users. In this paper, we present problems of this approach. Concretely, we (i) formulate desired characteristics of explanation quality, (ii) describe how current evaluation practices violate them, and (iii) support our argumentation with initial evidence from a crowdsourcing case study in which we investigate the explanation quality of state-of-the-art explainable question answering systems. We find that proxy scores correlate poorly with human quality ratings and, additionally, become less expressive the more often they are used (i.e. following Goodhart's law). Finally, we propose guidelines to enable a meaningful evaluation of explanations to drive the development of systems that provide tangible benefits to human users.
翻译:虽然许多研究侧重于解释,但目前还不清楚如何以有意义的方式评价解释的质量。今天的主要方法是使用代用评分来量化解释,将代用评分与(人注解的)黄金解释进行比较。这种方法假定,达到较高代用评分的解释也会给人类用户带来更大的好处。在本文中,我们提出了这种方法的问题。具体地说,我们(一) 制定解释质量的理想特征,(二) 描述目前的评价做法如何违反这些特征,(三) 以众包案例研究的初步证据来支持我们的论证。在这项研究中,我们调查了最先进的可解释问题回答系统的解释质量。我们发现代用评分与人质量评分不相干,而且更不那么经常使用(即遵循古德哈特法律)的方法。最后,我们提出指导方针,以便能够对解释进行有意义的评价,以推动为人类用户提供实际好处的系统的发展。</s>