Breakthroughs in transformer-based models have revolutionized not only the NLP field, but also vision and multimodal systems. However, although visualization and interpretability tools have become available for NLP models, internal mechanisms of vision and multimodal transformers remain largely opaque. With the success of these transformers, it is increasingly critical to understand their inner workings, as unraveling these black-boxes will lead to more capable and trustworthy models. To contribute to this quest, we propose VL-InterpreT, which provides novel interactive visualizations for interpreting the attentions and hidden representations in multimodal transformers. VL-InterpreT is a task agnostic and integrated tool that (1) tracks a variety of statistics in attention heads throughout all layers for both vision and language components, (2) visualizes cross-modal and intra-modal attentions through easily readable heatmaps, and (3) plots the hidden representations of vision and language tokens as they pass through the transformer layers. In this paper, we demonstrate the functionalities of VL-InterpreT through the analysis of KD-VLP, an end-to-end pretraining vision-language multimodal transformer-based model, in the tasks of Visual Commonsense Reasoning (VCR) and WebQA, two visual question answering benchmarks. Furthermore, we also present a few interesting findings about multimodal transformer behaviors that were learned through our tool.
翻译:在以变压器为基础的模型中,突破的突破不仅使NLP领域发生革命,而且使愿景和多式联运系统也发生了革命性变化。然而,虽然为NLP模型提供了可视化和可解释的工具,但内视和多式联运变压器的内部机制仍然基本不透明。随着这些变压器的成功,人们越来越需要理解其内部运作,因为拆解这些黑盒子将导致更有能力和更值得信赖的模式。为了促进这一探索,我们提议VL-InterpreT,它为解释多式变压器中的注意力和隐蔽表现提供了新的互动可视化工具。VL-Interpret是一个任务性综合工具,它(1) 跟踪所有层次的注意对象对视觉和语言组成部分的各种统计数据,(2) 通过容易读取的热测仪将跨模式和内部的注意力化,(3) 绘制通过变压器层层传递的视觉和语言象征的隐藏的表达方式。在本文中,我们通过对KD-VLP的分析展示了VP的功能性解释。VL-解释是一种任务,这是一种在视觉-最后到最后的视野-图像-图像-图像-图像-图像-图像-图像-图像-图像-图像-图像-日历-理解-理解-理解-理解-格式-结论,我们的两个。