Alternatively inferring on the visual facts and commonsense is fundamental for an advanced VQA system. This ability requires models to go beyond the literal understanding of commonsense. The system should not just treat objects as the entrance to query background knowledge, but fully ground commonsense to the visual world and imagine the possible relationships between objects, e.g., "fork, can lift, food". To comprehensively evaluate such abilities, we propose a VQA benchmark, CRIC, which introduces new types of questions about Compositional Reasoning on vIsion and Commonsense, and an evaluation metric integrating the correctness of answering and commonsense grounding. To collect such questions and rich additional annotations to support the metric, we also propose an automatic algorithm to generate question samples from the scene graph associated with the images and the relevant knowledge graph. We further analyze several representative types of VQA models on the CRIC dataset. Experimental results show that grounding the commonsense to the image region and joint reasoning on vision and commonsense are still challenging for current approaches. The dataset is available at https://cricvqa.github.io.
翻译:这种能力要求模型超越对常识的字面理解。这个系统不应仅仅将对象视为查询背景知识的入口,而应充分确立对视觉世界的共同认识,并想象物体之间可能存在的关系,例如“叉子、可以提升、食物”等。为了全面评价这种能力,我们提议了一个《自愿质量评估》基准,即审评委,它提出了关于对 vision 和 Comissense 进行构成解释的新类型的问题,以及一种将回答和常识基础的正确性结合起来的评价指标。为了收集这些问题和丰富的补充说明以支持该指标,我们还提议了一种自动算法,从与图像和相关知识图有关的场景图中生成问题样品。我们进一步分析了审评委数据集中VQA模型的若干有代表性的类型。实验结果显示,将常识定位作为图像区域的基础,关于视觉和常识的联合推理对于当前的做法仍然具有挑战性。数据集可在 https://crikovqa.github.io查阅。