Transformers have become a default architecture in computer vision, but understanding what drives their predictions remains a challenging problem. Current explanation approaches rely on attention values or input gradients, but these provide a limited understanding of a model's dependencies. Shapley values offer a theoretically sound alternative, but their computational cost makes them impractical for large, high-dimensional models. In this work, we aim to make Shapley values practical for vision transformers (ViTs). To do so, we first leverage an attention masking approach to evaluate ViTs with partial information, and we then develop a procedure for generating Shapley value explanations via a separate, learned explainer model. Our experiments compare Shapley values to many baseline methods (e.g., attention rollout, GradCAM, LRP), and we find that our approach provides more accurate explanations than existing methods for ViTs.
翻译:变形器已成为计算机愿景中的默认结构,但了解是什么驱动其预测仍然是一个挑战性的问题。 目前的解释方法依赖于关注值或输入梯度, 但这些解释方法对模型依赖度的理解有限。 变形器提供了一种理论上合理的替代方法, 但其计算成本使得大型高维模型的变形器不切实际。 在这项工作中, 我们的目标是让变形器的变形器( ViTs) 实用化变形器( ViTs ) 。 为了做到这一点, 我们首先利用一种关注掩盖方法, 用部分信息来评估VITs, 然后我们开发一种程序, 通过一个独立、 学习的解析模型来生成变相值解释。 我们的实验将变形体值与许多基线方法( 例如, 聚焦、 GradCAM、 LRP) 进行比较, 我们发现我们的方法比 ViTs 的现有方法提供更准确的解释。