Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described. Recently, various approaches have been developed and have achieved high performance on visual commonsense benchmarks. However, it is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. To provide an in-depth analysis, we present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation. Lastly, our in-depth analysis and comparison reveal interesting findings: (1) semantically low-level information can assist the learning of high-level information but not the opposite; (2) visual information is generally under utilization compared with text.
翻译:视觉常识理解要求视觉语言模型不仅需要理解图像和文字,而且还需要相互参照,以充分整合和理解所描述的视觉场景。最近,制定了各种办法,在视觉常识基准方面取得了很高的业绩。然而,由于评价数据资源有限,尚不清楚这些模型是否真正理解视觉场景和常识基础知识。为了提供深入分析,我们提出了一个多式评价(ME)管道,自动生成问答对口,以测试模型对视觉场景、文字和相关知识的理解。然后,我们进一步表明,与ME数据培训可促进该模型在标准VCR评价方面的业绩。最后,我们的深入分析和比较揭示了有趣的发现:(1) 语言低层次信息有助于高级信息的学习,但并非相反;(2) 视觉信息通常与文本相比得到利用。