Attention mechanisms are dominating the explainability of deep models. They produce probability distributions over the input, which are widely deemed as feature-importance indicators. However, in this paper, we find one critical limitation in attention explanations: weakness in identifying the polarity of feature impact. This would be somehow misleading -- features with higher attention weights may not faithfully contribute to model predictions; instead, they can impose suppression effects. With this finding, we reflect on the explainability of current attention-based techniques, such as Attentio$\odot$Gradient and LRP-based attention explanations. We first propose an actionable diagnostic methodology (henceforth faithfulness violation test) to measure the consistency between explanation weights and the impact polarity. Through the extensive experiments, we then show that most tested explanation methods are unexpectedly hindered by the faithfulness violation issue, especially the raw attention. Empirical analyses on the factors affecting violation issues further provide useful observations for adopting explanation methods in attention models.
翻译:关注机制正在主导深层模型的可解释性,它们产生投入的概率分布,被广泛视为特征重要性指标。然而,在本文中,我们发现关注解释中有一个关键限制:在确定特征影响的极性方面软弱无力。这将在某种程度上产生误导 -- -- 关注权重较高的特征可能无法忠实地促进模型预测;相反,它们可以施加抑制效应。通过这一发现,我们思考当前基于关注的技术,如Attentio$\odot$Gradient和基于LRP的注意解释的可解释性。我们首先提出了一种可操作的诊断方法(因此的违反诚信测试),以衡量解释权与影响极性之间的一致性。我们随后通过广泛的实验,表明大多数经过测试的解释方法都出乎意料地受到违反诚信问题、特别是原始关注的阻碍。对影响侵权问题的各种因素的实证分析为在关注模型中采用解释方法提供了有益的意见。