While many methods purport to explain predictions by highlighting salient features, what precise aims these explanations serve and how to evaluate their utility are often unstated. In this work, we formalize the value of explanations using a student-teacher paradigm that measures the extent to which explanations improve student models in learning to simulate the teacher model on unseen examples for which explanations are unavailable. Student models incorporate explanations in training (but not prediction) procedures. Unlike many prior proposals to evaluate explanations, our approach cannot be easily gamed, enabling principled, scalable, and automatic evaluation of attributions. Using our framework, we compare multiple attribution methods and observe consistent and quantitative differences amongst them across multiple learning strategies.
翻译:虽然许多方法旨在通过突出突出的特征来解释预测,但这些解释的准确目的是什么,以及如何评价其效用往往没有说明。在这项工作中,我们正式确定使用学生-教师范式解释的价值,这种范式衡量学生模型在学习如何改进学生模型,以模拟教师模型时模拟无法解释的不可见的例子。学生模型将解释纳入培训(而不是预测)程序。与以前许多评估解释的建议不同,我们的方法不易玩弄、能够有原则、可缩放和自动评估属性。我们利用我们的框架,比较多种归属方法,并在多种学习战略中观察到它们之间的一致和数量差异。