Today's computer vision models achieve human or near-human level performance across a wide variety of vision tasks. However, their architectures, data, and learning algorithms differ in numerous ways from those that give rise to human vision. In this paper, we investigate the factors that affect the alignment between the representations learned by neural networks and human mental representations inferred from behavioral responses. We find that model scale and architecture have essentially no effect on the alignment with human behavioral responses, whereas the training dataset and objective function both have a much larger impact. These findings are consistent across three datasets of human similarity judgments collected using two different tasks. Linear transformations of neural network representations learned from behavioral responses from one dataset substantially improve alignment with human similarity judgments on the other two datasets. In addition, we find that some human concepts such as food and animals are well-represented by neural networks whereas others such as royal or sports-related objects are not. Overall, although models trained on larger, more diverse datasets achieve better alignment with humans than models trained on ImageNet alone, our results indicate that scaling alone is unlikely to be sufficient to train neural networks with conceptual representations that match those used by humans.
翻译:如今,计算机视觉模型可在各种视觉任务中实现与人类或接近人类的水平。然而,它们的体系结构、数据和学习算法在很多方面都与导致人类视觉的体系结构、数据和学习算法不同。在本文中,我们研究影响神经网络学习的表示和从行为反应中推断出的人类心理表示之间对齐的因素。我们发现,模型的规模和体系结构对与人类行为反应的对齐基本没有影响,而训练数据集和目标函数则具有更大的影响。这些发现在使用两种不同任务收集的三个人类相似性判断数据集上保持一致。从一个数据集的行为反应学习的神经网络表示的线性变换在其他两个数据集上大大提高了与人类相似性判断的对齐度。此外,我们发现一些人类概念,如食品和动物,被神经网络很好地表示,而其他一些概念,如王室或与运动有关的对象,则没有很好地表示。总的来说,尽管在更大、更多样化的数据集上进行训练的模型使得与人类之间的对齐更好,但我们的结果表明,单靠扩展规模不足以训练具有与人类使用的概念表示相匹配的神经网络。