Using large pre-trained models for image recognition tasks is becoming increasingly common owing to the well acknowledged success of recent models like vision transformers and other CNN-based models like VGG and Resnet. The high accuracy of these models on benchmark tasks has translated into their practical use across many domains including safety-critical applications like autonomous driving and medical diagnostics. Despite their widespread use, image models have been shown to be fragile to changes in the operating environment, bringing their robustness into question. There is an urgent need for methods that systematically characterise and quantify the capabilities of these models to help designers understand and provide guarantees about their safety and robustness. In this paper, we propose Vision Checklist, a framework aimed at interrogating the capabilities of a model in order to produce a report that can be used by a system designer for robustness evaluations. This framework proposes a set of perturbation operations that can be applied on the underlying data to generate test samples of different types. The perturbations reflect potential changes in operating environments, and interrogate various properties ranging from the strictly quantitative to more qualitative. Our framework is evaluated on multiple datasets like Tinyimagenet, CIFAR10, CIFAR100 and Camelyon17 and for models like ViT and Resnet. Our Vision Checklist proposes a specific set of evaluations that can be integrated into the previously proposed concept of a model card. Robustness evaluations like our checklist will be crucial in future safety evaluations of visual perception modules, and be useful for a wide range of stakeholders including designers, deployers, and regulators involved in the certification of these systems. Source code of Vision Checklist would be open for public use.
翻译:在图像识别任务方面,使用经过预先培训的大型模型越来越普遍,因为人们广泛承认最近一些模型的成功,如视觉变压器和VGG和Resnet等其他CNN的模型。这些基准任务模型的高度精确性已转化为在许多领域的实际应用,包括安全关键应用软件,如自主驾驶和医学诊断。尽管这些模型广泛使用,但事实证明它们易受操作环境变化的影响,使其稳健性受到质疑。迫切需要采用系统化地定性和量化这些模型的能力的方法,以帮助设计者理解和保证其安全和稳健性。在本文件中,我们提出了愿景核对列表,目的是检验模型的能力,以便产生一份报告,供系统设计者用来进行稳健性评估。这个框架提出一套扰动性操作操作操作操作环境的变化,反映操作环境的潜在变化,并调查从严格定量到更定性的各种特性。我们的框架是多个数据集,如Tinyimagenet、CIFAR10、CIFAR和RE17等关键性评估。这个框架将用来在我们的视觉和SVILLA中提出一个核心评估。