Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. Our results on a visual question answering task which requires external knowledge (OK-VQA) show that our text-only model outperforms pretrained multimodal (image-text) models of comparable number of parameters. In contrast, our model is less effective in a standard VQA task (VQA 2.0) confirming that our text-only method is specially effective for tasks requiring external knowledge. In addition, we show that our unimodal model is complementary to multimodal models in both OK-VQA and VQA 2.0, and yield the best result to date in OK-VQA among systems not using external knowledge graphs, and comparable to systems that do use them. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks.
翻译:在视觉问答(VQA)等视觉语言任务中,整合外部知识以进行推理是一个尚未解决的问题。鉴于经过预先训练的语言模型已经显示包括世界知识,我们提议使用基于图像自动脱落字幕和经过训练的语言模型的单式(文本专用)培训和推论程序。我们在视觉回答任务中得出的需要外部知识的视觉回答任务(OK-VQA)结果显示,我们仅有文本的模型比以前受过训练的多式(图像文本)模型的参数数量要多得多。相比之下,在标准VQA任务(VQA 2.0)中,我们的模型并不那么有效,确认我们只使用文本的方法对于需要外部知识的任务特别有效。此外,我们表明我们的单式模型是对在 OK-VQA 和 VQA 2.0 中多式模型的补充,在使用外部知识图表的系统(OK-VQA ) 中产生最佳的结果,而没有使用外部知识图形的系统则可以比较。我们关于OK-VA 质量分析显示,在标准VQA 任务中,自动解释往往无法比我们图像更加平衡地掌握相关的可能性。