While many methods purport to explain predictions by highlighting salient features, what aims these explanations serve and how they ought to be evaluated often go unstated. In this work, we introduce a framework to quantify the value of explanations via the accuracy gains that they confer on a student model trained to simulate a teacher model. Crucially, the explanations are available to the student during training, but are not available at test time. Compared to prior proposals, our approach is less easily gamed, enabling principled, automatic, model-agnostic evaluation of attributions. Using our framework, we compare numerous attribution methods for text classification and question answering, and observe quantitative differences that are consistent (to a moderate to high degree) across different student model architectures and learning strategies.
翻译:虽然许多方法旨在通过突出突出的特征来解释预测,但这些解释的目的是什么,以及应该如何评价这些解释往往没有说明。在这项工作中,我们引入了一个框架,以量化解释的价值,其方法是通过它们赋予经过训练模拟教师模式的学生模型的准确性增益。关键的是,这些解释在培训期间可供学生使用,但在测试时无法提供。与以前的提案相比,我们的方法不太容易玩弄,从而使得能够对归属进行原则性、自动、模式性、不可知性的评估。我们利用我们的框架,比较了文本分类和回答问题的许多归属方法,并观察不同学生模型结构和学习战略之间一致(中度至高度)的数量差异。