Visual question answering (VQA) for remote sensing scene has great potential in intelligent human-computer interaction system. Although VQA in computer vision has been widely researched, VQA for remote sensing data (RSVQA) is still in its infancy. There are two characteristics that need to be specially considered for the RSVQA task. 1) No object annotations are available in RSVQA datasets, which makes it difficult for models to exploit informative region representation; 2) There are questions with clearly different difficulty levels for each image in the RSVQA task. Directly training a model with questions in a random order may confuse the model and limit the performance. To address these two problems, in this paper, a multi-level visual feature learning method is proposed to jointly extract language-guided holistic and regional image features. Besides, a self-paced curriculum learning (SPCL)-based VQA model is developed to train networks with samples in an easy-to-hard way. To be more specific, a language-guided SPCL method with a soft weighting strategy is explored in this work. The proposed model is evaluated on three public datasets, and extensive experimental results show that the proposed RSVQA framework can achieve promising performance.
翻译:遥感现场的视觉解答(VQA)在智能人类-计算机互动系统方面具有巨大的潜力。虽然计算机视野中的VQA已经进行了广泛的研究,但遥感数据VQA(RSVQA)仍然处于初级阶段。对于RSVQA的任务,需要特别考虑两个特点。 1 RSVQA数据集中没有提供对象说明,这使得模型难以利用信息丰富的区域代表性; 2 RSVQA任务中每个图像的难度程度明显不同。直接培训一个随机顺序问题模型可能会混淆模型并限制性能。为了解决这两个问题,本文件建议采用多层次的视觉特征学习方法,共同提取语言导出的整体和区域图像特征。此外,还开发了一个以自我节奏课程学习(SPCL)为基础的VA模型,以便以简单易懂的方式对样本网络进行培训。更具体地说,在这项工作中,一种语言导出具有软加权战略的SPCL方法可能会混淆模型。为了解决这两个问题,拟议的模型将在三个公共数据集中进行评价,能够实现所拟议的RSA的有希望的绩效和广泛实验结果。