部分成功弥合人与机人愿景之间的差距 (Partial success in closing the gap between human and machine vision)

from arxiv, NeurIPS 2021 Oral, camera ready version. A preliminary version of this work was presented as Oral at the 2020 NeurIPS workshop on "Shared Visual Representations in Human & Machine Intelligence" (arXiv:2010.08377)

A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines "in the wild" and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision? To answer this question, we tested human observers on a broad range of out-of-distribution (OOD) datasets, recording 85,120 psychophysical trials across 90 participants. We then investigated a range of promising machine learning developments that crucially deviate from standard supervised CNNs along three axes: objective function (self-supervised, adversarially trained, CLIP language-image training), architecture (e.g. vision transformers), and dataset size (ranging from 1M to 1B). Our findings are threefold. (1.) The longstanding distortion robustness gap between humans and CNNs is closing, with the best models now exceeding human feedforward performance on most of the investigated OOD datasets. (2.) There is still a substantial image-level consistency gap, meaning that humans make different errors than models. In contrast, most models systematically agree in their categorisation errors, even substantially different ones like contrastive self-supervised vs. standard supervised models. (3.) In many cases, human-to-model consistency improves when training dataset size is increased by one to three orders of magnitude. Our results give reason for cautious optimism: While there is still much room for improvement, the behavioural difference between human and machine vision is narrowing. In order to measure future progress, 17 OOD datasets with image-level human behavioural data and evaluation code are provided as a toolbox and benchmark at: https://github.com/bethgelab/model-vs-human/

翻译：几年前,第一个CNN在图像网络上超越了人类的性能。然而,不久就变得很清楚,机器在更具挑战性的测试案例上缺乏强健性,这是在“野外”部署机器和获得更好的人类视觉感知计算模型方面的主要障碍。在这里,我们问:我们在缩小人类和机器视觉之间的差距方面是否取得进展?为了回答这个问题,我们测试了人类观察者在广泛的超出分配范围(OOOD)数据集上的广泛扭曲性强力差距,记录了90名参与者之间的85,120次心理物理试验。我们随后调查了一系列有希望的机器学习动态,这些动态大大偏离了三个轴线上监管的标准CNN:客观功能(自我监督、对抗性训练、CLIP语言图像培训)、结构(例如视觉变异体)和数据集大小(从1M到1B不等)。我们的调查结果是三倍。人类和CNNs之间的长期扭曲性强力差距正在缩小,目前17个最佳模型已经超越了人类对 OOD数据集的进化性表现。 (2.) 在三个图像级的清晰度上仍然存在巨大的一致性差距,意味着, 人类的图像级一致性差距比,意味着人类测量数据模型比不同的模型比不同的数据变化模型。