Visual question answering (VQA) is an important and challenging multimodal task in computer vision. Recently, a few efforts have been made to bring VQA task to aerial images, due to its potential real-world applications in disaster monitoring, urban planning, and digital earth product generation. However, not only the huge variation in the appearance, scale and orientation of the concepts in aerial images, but also the scarcity of the well-annotated datasets restricts the development of VQA in this domain. In this paper, we introduce a new dataset, HRVQA, which provides collected 53512 aerial images of 1024*1024 pixels and semi-automatically generated 1070240 QA pairs. To benchmark the understanding capability of VQA models for aerial images, we evaluate the relevant methods on HRVQA. Moreover, we propose a novel model, GFTransformer, with gated attention modules and a mutual fusion module. The experiments show that the proposed dataset is quite challenging, especially the specific attribute related questions. Our method achieves superior performance in comparison to the previous state-of-the-art approaches. The dataset and the source code will be released at https://hrvqa.nl/.
翻译:视觉问题解答(VQA)是计算机视野中一项重要且具有挑战性的多式任务。最近,由于VQA在灾害监测、城市规划和数字地球产品生成方面的潜在实际应用,最近作出了一些努力,将VQA的任务带给航空图像,然而,不仅航空图像中概念的外观、规模和方向差异巨大,而且高附加说明的数据集稀缺,限制了该领域VQA的发展。在本文中,我们引入了新的数据集,即HRVQA,它收集了53512个1024*1024像素和半自动生成的1070240QA配对的航空图像。为确定VQA模型对航空图像的理解能力,我们评估了HRVA的相关方法。此外,我们提出了一个新的模型,即GFTransorex,配有门外注意模块和相互融合模块。实验表明,拟议的数据集具有相当大的挑战性,特别是具体的属性相关问题。我们的方法将比先前的状态方法取得优异性性性。数据设置和源代码将发布于 httpset/qav://qa.