Despite the popularity of Vision Transformers (ViTs) and eXplainable AI (XAI), only a few explanation methods have been proposed for ViTs thus far. They use attention weights of the classification token on patch embeddings and often produce unsatisfactory saliency maps. In this paper, we propose a novel method for explaining ViTs called ViT-CX. It is based on patch embeddings, rather than attentions paid to them, and their causal impacts on the model output. ViT-CX can be used to explain different ViT models. Empirical results show that, in comparison with previous methods, ViT-CX produces more meaningful saliency maps and does a better job at revealing all the important evidence for prediction. It is also significantly more faithful to the model as measured by deletion AUC and insertion AUC.
翻译:尽管视野变异器(Viet Trangers)和可移植的AI(XAI)很受欢迎,但迄今只为Viats提出了几种解释方法。它们使用补丁嵌入的分类符号的注意权重,并往往产生不令人满意的突出地图。在本文中,我们提出了一种解释ViTs(ViT-CX)的新颖方法。它基于补丁嵌入,而不是关注它们,以及它们对模型输出的因果关系。ViT-CX可以用来解释不同的ViT模型。经验性结果显示,与以往的方法相比,ViT-CX(Vit-CX)制作了更有意义的突出的地图,在揭示所有重要的预测证据方面做了更好的工作。它也明显地更忠实于通过删除AUC和插入AUC衡量的模式。