When a model attribution technique highlights a particular part of the input, a user might understand this highlight as making a statement about counterfactuals (Miller, 2019): if that part of the input were to change, the model's prediction might change as well. This paper investigates how well different attribution techniques align with this assumption on realistic counterfactuals in the case of reading comprehension (RC). RC is a particularly challenging test case, as token-level attributions that have been extensively studied in other NLP tasks such as sentiment analysis are less suitable to represent the reasoning that RC models perform. We construct counterfactual sets for three different RC settings, and through heuristics that can connect attribution methods' outputs to high-level model behavior, we can evaluate how useful different attribution methods and even different formats are for understanding counterfactuals. We find that pairwise attributions are better suited to RC than token-level attributions across these different RC settings, with our best performance coming from a modification that we propose to an existing pairwise attribution method.
翻译:当模型归因技术突出输入的某个特定部分时,用户可能会理解这段强调的关于反事实的陈述(Miller, 2019年):如果这部分投入要改变,模型的预测也会改变。本文调查了不同的归因技术如何与关于理解(RC)情况下现实反事实的假设相一致。 驻地协调员是一个特别具有挑战性的测试案例,因为在其他非驻地项目任务中广泛研究的象征性归因,如情绪分析,不太适合代表驻地协调员模型的推理。 我们为三个不同的驻地协调员设置了反事实数据集,并通过超自然学将归因方法的产出与高级示范行为联系起来,我们可以评估不同的归因方法、甚至不同格式对于理解反事实的有用程度。我们发现,对称的归因更适合驻地协调员,而不是这些不同的驻地协调员环境中的代为归因,我们的最佳性表现来自我们提出的对称归因方法的修改。