Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resulting data is of high-quality. VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits and achieve a level of robustness that lacks in the same model trained on human-annotated VQA data.
翻译:视觉问题解答(VQA)得益于日益复杂的模型,但在数据生成方面没有达到同样的参与水平。在本文中,我们提出一种方法,通过利用现有大量图像解析说明,与文字问题生成的神经模型相结合,在数量上自动生成VQA实例。我们显示,由此产生的数据是高质量的。通过我们的数据培训的VQA模型,通过两位数提高最新的最新零射精确度,并达到在人类附加说明的VQA数据方面受过培训的同一模型所缺乏的稳健度。