Computer vision models excel at making predictions when the test distribution closely resembles the training distribution. Such models have yet to match the ability of biological vision to learn from multiple sources and generalize to new data sources and tasks. To facilitate the development and evaluation of more general vision systems, we introduce the General Robust Image Task (GRIT) benchmark. GRIT evaluates the performance, robustness, and calibration of a vision system across a variety of image prediction tasks, concepts, and data sources. The seven tasks in GRIT are selected to cover a range of visual skills: object categorization, object localization, referring expression grounding, visual question answering, segmentation, human keypoint detection, and surface normal estimation. GRIT is carefully designed to enable the evaluation of robustness under image perturbations, image source distribution shift, and concept distribution shift. By providing a unified platform for thorough assessment of skills and concepts learned by a vision model, we hope GRIT catalyzes the development of performant and robust general purpose vision systems.
翻译:在测试分布与培训分布十分相似时,计算机视觉模型在预测测试分布方面非常出色;这些模型尚未与生物视觉从多种来源学习的能力相匹配,尚未向新的数据来源和任务进行概括化;为便利开发和评价更通用的视觉系统,我们引入了GRIT基准;GRIT评估了各种图像预测任务、概念和数据来源的视觉系统的性能、稳健性和校准性;选择了GRIT的七项任务,以涵盖一系列视觉技能:对象分类、对象本地化、参考表达定位、视觉问题回答、分解、人类关键点探测和表面正常估计;GRIT经过仔细设计,以便能够在图像扰动、图像源分布变化和概念分布变化下评估稳健性;我们希望GRIT通过提供一个统一平台,对通过视觉模型所学的技能和概念进行彻底评估,我们希望GRIT对性能和稳健的通用视觉系统的开发进行催化。