Answering visual questions need acquire daily common knowledge and model the semantic connection among different parts in images, which is too difficult for VQA systems to learn from images with the only supervision from answers. Meanwhile, image captioning systems with beam search strategy tend to generate similar captions and fail to diversely describe images. To address the aforementioned issues, we present a system to have these two tasks compensate with each other, which is capable of jointly producing image captions and answering visual questions. In particular, we utilize question and image features to generate question-related captions and use the generated captions as additional features to provide new knowledge to the VQA system. For image captioning, our system attains more informative results in term of the relative improvements on VQA tasks as well as competitive results using automated metrics. Applying our system to the VQA tasks, our results on VQA v2 dataset achieve 65.8% using generated captions and 69.1% using annotated captions in validation set and 68.4% in the test-standard set. Further, an ensemble of 10 models results in 69.7% in the test-standard split.
翻译:答案的视觉问题需要获得日常的共同知识,并模拟图像中不同部分之间的语义联系,这对VQA系统来说太难从图像中学习,只有从答案的监管才能从图像中学习。 同时,带梁搜索策略的图像字幕系统往往产生相似的字幕,无法以不同的方式描述图像。为了解决上述问题,我们提出了一个系统,使这两个任务相互弥补,能够联合制作图像字幕和回答视觉问题。特别是,我们利用问题和图像特征生成与问题相关的字幕,并使用生成的字幕作为额外功能为VQA系统提供新知识。关于图像字幕,我们的系统在VQA任务的相对改进方面,以及在使用自动测量仪的竞争结果方面,取得了更多信息效果。在VQA任务中应用我们的系统,我们关于VQA v2数据集的结果使用生成的字幕达到65.8%,在测试标准集中使用附加说明的字幕达到69.1%。此外,在测试标准分割中,10个模型的结果达到69.7%。