Visual question answering is an important task in both natural language and vision understanding. However, in most of the public visual question answering datasets such as VQA, CLEVR, the questions are human generated that specific to the given image, such as `What color are her eyes?'. The human generated crowdsourcing questions are relatively simple and sometimes have the bias toward certain entities or attributes. In this paper, we introduce a new question answering dataset based on image-ChiQA. It contains the real-world queries issued by internet users, combined with several related open-domain images. The system should determine whether the image could answer the question or not. Different from previous VQA datasets, the questions are real-world image-independent queries that are more various and unbiased. Compared with previous image-retrieval or image-caption datasets, the ChiQA not only measures the relatedness but also measures the answerability, which demands more fine-grained vision and language reasoning. ChiQA contains more than 40K questions and more than 200K question-images pairs. A three-level 2/1/0 label is assigned to each pair indicating perfect answer, partially answer and irrelevant. Data analysis shows ChiQA requires a deep understanding of both language and vision, including grounding, comparisons, and reading. We evaluate several state-of-the-art visual-language models such as ALBEF, demonstrating that there is still a large room for improvements on ChiQA.
翻译:视觉问题解答是自然语言和视觉理解方面的一项重要任务。然而,在大多数公开视觉问题解答的数据集(如VQA、CLEVR)中,问题是由人类产生的,是特定图像所特有的,例如“她的眼睛是什么颜色?” 。由人类产生的众包问题相对简单,有时偏向某些实体或属性。在本文中,我们引入了一个基于图像-CHIQA的新问题解答数据集。它包含由互联网用户发布的真实世界查询,以及一些相关的开放域域域图像。系统应该确定图像能否回答问题。与以前的VQA数据集不同,问题都是基于真实世界图像的不偏向性查询。与以往的图像检索或图像显示数据集相比, ChiQA不仅测量了相关性, 也测量了答案, 这需要更精细的视野和语言推理。 ChiQA包含超过40K的问题和超过200K的问答配对。 与以前的VQA 不同, 问题是真实的、 AL 1/ 1 分析, 需要对每个直观和深层次的图像的解算。