Visual question answering (VQA) is of significant interest due to its potential to be a strong test of image understanding systems and to probe the connection between language and vision. Despite much recent progress, general VQA is far from a solved problem. In this paper, we focus on the VQA multiple-choice task, and provide some good practices for designing an effective VQA model that can capture language-vision interactions and perform joint reasoning. We explore mechanisms of incorporating part-of-speech (POS) tag guided attention, convolutional n-grams, triplet attention interactions between the image, question and candidate answer, and structured learning for triplets based on image-question pairs. We evaluate our models on two popular datasets: Visual7W and VQA Real Multiple Choice. Our final model achieves the state-of-the-art performance of 68.2% on Visual7W, and a very competitive performance of 69.6% on the test-standard split of VQA Real Multiple Choice.
翻译:视觉问题解答(VQA)非常重要,因为它有可能成为图像理解系统的有力测试,并探究语言和视觉之间的联系。尽管最近取得了许多进展,但一般VQA远非解决了问题。在本文中,我们侧重于VQA多重选择任务,并为设计有效的VQA模型提供一些良好做法,该模型能够捕捉语言-视觉互动并进行联合推理。我们探索了将部分语音标签标签引导关注、动态正克、图像、问题和候选人回答之间的三重关注互动以及基于图像-问题对子的三重对象结构化学习结合起来的机制。我们用两种流行数据集来评估我们的模型:视觉7W和VQA真实多重选择。我们的最后模型在视觉-7W上实现了68.2%的高级性能,在VQA真实多重选择的测试标准分法上实现了69.6%的极具竞争力的性能。