Human-like attention as a supervisory signal to guide neural attention has shown significant promise but is currently limited to uni-modal integration - even for inherently multimodal tasks such as visual question answering (VQA). We present the Multimodal Human-like Attention Network (MULAN) - the first method for multimodal integration of human-like attention on image and text during training of VQA models. MULAN integrates attention predictions from two state-of-the-art text and image saliency models into neural self-attention layers of a recent transformer-based VQA model. Through evaluations on the challenging VQAv2 dataset, we show that MULAN achieves a new state-of-the-art performance of 73.98% accuracy on test-std and 73.72% on test-dev and, at the same time, has approximately 80% fewer trainable parameters than prior work. Overall, our work underlines the potential of integrating multimodal human-like and neural attention for VQA
翻译:作为引导神经关注的监督信号,人类关注作为引导神经关注的监督信号,已经显示出巨大的希望,但目前仅限于单式一体化,甚至连像视觉问答(VQA)等固有的多式联运任务也是如此。 我们介绍了多式人类似关注网络(MURAN),这是在VQA模型培训中将人类关注的图像和文字的多式联运整合的第一种方法。 MURAN将两种最先进的文本和图像突出模型的注意力预测纳入最近以变压器为基础的VQA模型的神经自留层。 通过对具有挑战性的VQAv2数据集的评估,我们显示MURAN在测试阶段实现了73.98%的精度和73.72%的测试阶段实现了新的先进性能,同时,与以前的工作相比,培训参数大约减少了80%。总体而言,我们的工作强调了将多式人类和神经关注纳入VQA的潜力。