Visual question answering is the task of answering questions about images. We introduce the VizWiz-VQA-Grounding dataset, the first dataset that visually grounds answers to visual questions asked by people with visual impairments. We analyze our dataset and compare it with five VQA-Grounding datasets to demonstrate what makes it similar and different. We then evaluate the SOTA VQA and VQA-Grounding models and demonstrate that current SOTA algorithms often fail to identify the correct visual evidence where the answer is located. These models regularly struggle when the visual evidence occupies a small fraction of the image, for images that are higher quality, as well as for visual questions that require skills in text recognition. The dataset, evaluation server, and leaderboard all can be found at the following link: https://vizwiz.org/tasks-and-datasets/answer-grounding-for-vqa/.
翻译:视觉问题解答是回答图像问题的任务。 我们引入了 VizWiz- VQA- Grounding 数据集, 这是第一个视觉基础解答视觉障碍者询问的视觉问题的数据集。 我们分析了我们的数据集, 并将其与五个VQA- Grounding 数据集进行比较, 以显示是什么使得它相似和不同的数据集。 然后我们评估SOTA VQA 和 VQA- Grounding 模型, 并显示当前 SOTA 算法往往无法辨别答案所在的正确视觉证据。 当视觉证据占图像的一小部分、 质量更高的图像以及需要文本识别技能的视觉问题时, 这些模型会经常进行挣扎 。 数据集、 评价服务器和领导板都可以在以下链接中找到 : https://vizwizz.org/tasks-and- datasetests/ sweek-grounding-for- vqa/ 。