Methodologies for training visual question answering (VQA) models assume the availability of datasets with human-annotated \textit{Image-Question-Answer} (I-Q-A) triplets. This has led to heavy reliance on datasets and a lack of generalization to new types of questions and scenes. Linguistic priors along with biases and errors due to annotator subjectivity have been shown to percolate into VQA models trained on such samples. We study whether models can be trained without any human-annotated Q-A pairs, but only with images and their associated textual descriptions or captions. We present a method to train models with synthetic Q-A pairs generated procedurally from captions. Additionally, we demonstrate the efficacy of spatial-pyramid image patches as a simple but effective alternative to dense and costly object bounding box annotations used in existing VQA models. Our experiments on three VQA benchmarks demonstrate the efficacy of this weakly-supervised approach, especially on the VQA-CP challenge, which tests performance under changing linguistic priors.
翻译:用于培训直观回答(VQA)模型的方法假定,可以提供带有人文附加注释的 QA 模型的数据集(I-Q-A)三重文字说明(I-Q-A),这导致大量依赖数据集,对新类型的问题和场景缺乏概括性。语言前科以及说明性主观性造成的偏差和差错,已显示可渗透到经过此类样本培训的VQA模型中。我们研究的是,模型是否可以在没有人文附加注释的 QA 模型的情况下加以培训,但只能使用图像及其相关的文字说明或说明。我们提出了一个用合成QA 配对培训模型的方法,这些配对从程序上从说明中产生。此外,我们展示了空间-金字塔图像补丁的功效,作为现有VQA 模型中所使用的密度高、费用高的物体捆绑式说明的一种简单但有效的替代方法。我们关于VQA 3 VQA 基准的实验显示了这种低超度方法的功效,特别是在VQA-CP挑战上,在语言前期测试中测试性能。