Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding. Existing efforts to create a general vision model are limited in the scope of assessed tasks and offer no overarching framework to perform them holistically. We present a new comprehensive benchmark, General-purpose Visual Understanding Evaluation (G-VUE), covering the full spectrum of visual cognitive abilities with four functional domains $\unicode{x2014}$ Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation. Along with the benchmark, we provide a general encoder-decoder framework to allow for the evaluation of arbitrary visual representation on all 11 tasks. We evaluate various pre-trained visual representations with our framework and observe that (1) Transformer-based visual backbone generally outperforms CNN-based backbone on G-VUE, (2) visual representations from vision-language pre-training are superior to those with vision-only pre-training across visual tasks. With G-VUE, we provide a holistic evaluation standard to motivate research toward building general-purpose visual systems via obtaining more general-purpose visual representations.
翻译:与人类视觉系统不同,目前的计算机视觉模型尚不能实现一般目的视觉理解。现有的创造一般视觉模型的努力在评估任务的范围有限,没有提供整体执行这些任务的总体框架。我们提出了一个新的全面基准,即通用视觉理解评价(G-VUE),涵盖全方位视觉认知能力,包括四个功能领域($\uncode{x2014}$ Percecode、Gob、理由和Act),四个领域包含11项仔细制定的任务,从3D重建到视觉推理和操控。除了基准外,我们还提供一个普通编码解码框架,以便评价所有11项任务的任意视觉代表情况。我们评估了各种经过事先培训的视觉展示情况,并观察:(1) 基于变形的视觉骨架通常优于G-VUE的CNN骨架,(2) 视觉前训练的视觉表现优于具有全视前训练任务的视觉前训练。与G-VUE一起,我们提供了一个整体评价标准,以激励研究如何通过更一般目的视觉显示来建立一般目的视觉系统。