Transformers are increasingly dominating multi-modal reasoning tasks, such as visual question answering, achieving state-of-the-art results thanks to their ability to contextualize information using the self-attention and co-attention mechanisms. These attention modules also play a role in other computer vision tasks including object detection and image segmentation. Unlike Transformers that only use self-attention, Transformers with co-attention require to consider multiple attention maps in parallel in order to highlight the information that is relevant to the prediction in the model's input. In this work, we propose the first method to explain prediction by any Transformer-based architecture, including bi-modal Transformers and Transformers with co-attentions. We provide generic solutions and apply these to the three most commonly used of these architectures: (i) pure self-attention, (ii) self-attention combined with co-attention, and (iii) encoder-decoder attention. We show that our method is superior to all existing methods which are adapted from single modality explainability.
翻译:变异器正在日益主导多模式推理任务,例如视觉问答,由于能够利用自控和共控机制将信息背景化,实现最新结果,这些关注模块还在其他计算机视觉任务中发挥作用,包括物体探测和图像分割。与只使用自控的变异器不同,共同关注的变异器需要同时考虑多重关注地图,以突出与模型投入中的预测有关的信息。在这项工作中,我们提出了第一个解释任何变异器结构预测的方法,包括双式变异器和共同关注的变异器。我们提供了通用解决方案,并将其应用于这些结构中最常用的三种:(一) 纯自控,(二) 与共控相结合的自控,以及(三) 致coder-dcoder注意。我们表明,我们的方法优于从单一模式解释中调整的所有现有方法。