Large language models have demonstrated an emergent capability in answering knowledge intensive questions. With recent progress on web-scale visual and language pre-training, do these models also understand how to answer visual information seeking questions? To answer this question, we present InfoSeek, a Visual Question Answering dataset that focuses on asking information-seeking questions, where the information can not be answered by common sense knowledge. We perform a multi-stage human annotation to collect a natural distribution of high-quality visual information seeking question-answer pairs. We also construct a large-scale, automatically collected dataset by combining existing visual entity recognition datasets and Wikidata, which provides over one million examples for model fine-tuning and validation. Based on InfoSeek, we analyzed various pre-trained Visual QA systems to gain insights into the characteristics of different pre-trained models. Our analysis shows that it is challenging for the state-of-the-art multi-modal pre-trained models to answer visual information seeking questions, but this capability is improved through fine-tuning on the automated InfoSeek dataset. We hope our analysis paves the way to understand and develop the next generation of multi-modal pre-training.
翻译:大型语言模型展示了在回答知识密集型问题方面的突发能力。 在最近通过网络规模的视觉和语言预科培训的进展后, 这些模型是否也理解如何回答视觉信息询问问题? 为了回答这个问题,我们提供了侧重于询问信息查询问题的视觉问答数据集InfoSeek, 该数据集侧重于询问信息查询问题, 信息无法用常识知识解答。 我们用一个多阶段的人类批注来收集高质量视觉信息的自然分布, 以寻找答案。 我们还通过将现有视觉实体识别数据集和维基数据结合起来, 建立一个大规模、 自动收集的数据集, 为模型的微调和验证提供了超过100万个实例。 基于InfoSeek, 我们分析了各种经过预先训练的视觉QA系统, 以深入了解不同的预训练模型的特征。 我们的分析表明, 最先进的多模式的预培训模型在回答视觉信息查询问题时面临挑战, 但是通过对自动的Infeek数据集进行微调来改进这一能力。 我们希望我们的分析能为理解和发展下一代的多式前训练铺平道路。</s>