Explaining how important each input feature is to a classifier's decision is critical in high-stake applications. An underlying principle behind dozens of explanation methods is to take the prediction difference between before-and-after an input feature (here, a token) is removed as its attribution - the individual treatment effect in causal inference. A recent method called Input Marginalization (IM) (Kim et al., 2020) uses BERT to replace a token - i.e. simulating the do(.) operator - yielding more plausible counterfactuals. However, our rigorous evaluation using five metrics and on three datasets found IM explanations to be consistently more biased, less accurate, and less plausible than those derived from simply deleting a word.
翻译:解释每个输入特征对于分类员决定的重要性在高比例应用中至关重要。 数十种解释方法背后的一项基本原则是将输入特征(这里,一个象征)前后的预测差异视为其归属(因果推理中的个别处理效应 ) 。 最近一种称为输入边缘化的方法(IM) (Kim 等人,2020年), 使用BERT来替代一个符号 — — 即模拟“做”操作员 — — 产生更可信的相反事实。 然而,我们用五度和三套数据集进行的严格评估发现IM的解释总是比简单地删除一个单词所得出的解释更加偏差、不准确、不可信。