The Visual Question Answering (VQA) task aspires to provide a meaningful testbed for the development of AI models that can jointly reason over visual and natural language inputs. Despite a proliferation of VQA datasets, this goal is hindered by a set of common limitations. These include a reliance on relatively simplistic questions that are repetitive in both concepts and linguistic structure, little world knowledge needed outside of the paired image, and limited reasoning required to arrive at the correct answer. We introduce A-OKVQA, a crowdsourced dataset composed of a diverse set of about 25K questions requiring a broad base of commonsense and world knowledge to answer. In contrast to the existing knowledge-based VQA datasets, the questions generally cannot be answered by simply querying a knowledge base, and instead require some form of commonsense reasoning about the scene depicted in the image. We demonstrate the potential of this new dataset through a detailed analysis of its contents and baseline performance measurements over a variety of state-of-the-art vision-language models. Project page: http://a-okvqa.allenai.org/
翻译:视觉问题解答(VQA)任务旨在为开发能够共同理解视觉和自然语言投入的AI模型提供一个有意义的测试台。尽管VQA数据集激增,但这一目标仍受到一系列共同限制的阻碍,其中包括依赖概念和语言结构重复的相对简单的问题,在配对图像之外所需的世界知识很少,以及得出正确答案所需的有限推理。我们引入了A-OKVQA,这是一个由一组多样化问题组成的众包数据集,由大约25K组问题组成,需要广泛的共通和世界知识基础来回答。与现有的基于知识的VQA数据集相比,这些问题一般无法通过简单的查询知识库来回答,而是要求就图像中描述的场景进行某种形式的共同思维推理。我们通过详细分析其内容和对各种状态-艺术视觉-语言模型的基线性能测量,来展示这种新数据集的潜力。项目网页:http://a-okvqa.allenai.org/Project网页:http://a-okvqa.allenai.