Modern machine learning models for computer vision exceed humans in accuracy on specific visual recognition tasks, notably on datasets like ImageNet. However, high accuracy can be achieved in many ways. The particular decision function found by a machine learning system is determined not only by the data to which the system is exposed, but also the inductive biases of the model, which are typically harder to characterize. In this work, we follow a recent trend of in-depth behavioral analyses of neural network models that go beyond accuracy as an evaluation metric by looking at patterns of errors. Our focus is on comparing a suite of standard Convolutional Neural Networks (CNNs) and a recently-proposed attention-based network, the Vision Transformer (ViT), which relaxes the translation-invariance constraint of CNNs and therefore represents a model with a weaker set of inductive biases. Attention-based networks have previously been shown to achieve higher accuracy than CNNs on vision tasks, and we demonstrate, using new metrics for examining error consistency with more granularity, that their errors are also more consistent with those of humans. These results have implications both for building more human-like vision models, as well as for understanding visual object recognition in humans.
翻译:计算机视觉的现代机器学习模型在具体视觉识别任务,特别是图像网络等数据集的精确度方面超越了人类。 但是,可以通过多种方式实现高精度。 机器学习系统发现的特殊决策功能不仅取决于系统所接触的数据,而且还取决于模型的感应偏差,而这种偏差通常比较难定性。 在这项工作中,我们跟踪最近对神经网络模型的深入行为分析趋势,这种分析超越了精确度,通过查看错误的形态来作为评价度量。我们的重点是比较一套标准进化神经网络(CNNs)和最近提出的关注网络,即视觉变换器(ViT),它缓解CNN的翻译差异性限制,因此代表了模型中较弱的感应偏差。 关注网络以前已经显示在视觉任务上比CNNs更精确,我们用新的指标来检查与颗粒性更一致的误差,我们证明,它们的误差也与人类的误差更加一致。 这些结果对于建立更像人类的视觉模型具有影响,作为视觉的识别目标。