Most existing research on visual question answering (VQA) is limited to information explicitly present in an image or a video. In this paper, we take visual understanding to a higher level where systems are challenged to answer questions that involve mentally simulating the hypothetical consequences of performing specific actions in a given scenario. Towards that end, we formulate a vision-language question answering task based on the CLEVR (Johnson et. al., 2017) dataset. We then modify the best existing VQA methods and propose baseline solvers for this task. Finally, we motivate the development of better vision-language models by providing insights about the capability of diverse architectures to perform joint reasoning over image-text modality. Our dataset setup scripts and codes will be made publicly available at https://github.com/shailaja183/clevr_hyp.
翻译:关于视觉问题解答(VQA)的现有研究大多限于图像或视频中明确显示的信息。 在本文中,我们将视觉理解带给更高层次的系统,使其无法回答涉及在特定情景下采取具体行动的假设后果的问题。为此,我们根据CLEVR(Johnson等人,2017年)数据集制定了一个愿景问题解答任务。然后我们修改现有的最佳VQA方法,并为这项任务提出基线解决方案。最后,我们通过提供不同结构对图像文本模式进行联合推理的能力的洞察力,激励开发更好的视觉语言模型。我们的数据集的脚本和代码将在https://github.com/shailaja183/clevr_hyp上公布。