Today's computer vision models achieve human or near-human level performance across a wide variety of vision tasks. However, their architectures, data, and learning algorithms differ in numerous ways from those that give rise to human vision. In this paper, we investigate the factors that affect alignment between the representations learned by neural networks and human concept representations. Human representations are inferred from behavioral responses in an odd-one-out triplet task, where humans were presented with three images and had to select the odd-one-out. We find that model scale and architecture have essentially no effect on alignment with human behavioral responses, whereas the training dataset and objective function have a much larger impact. Using a sparse Bayesian model of human conceptual representations, we partition triplets by the concept that distinguishes the two similar images from the odd-one-out, finding that some concepts such as food and animals are well-represented in neural network representations whereas others such as royal or sports-related objects are not. Overall, although models trained on larger, more diverse datasets achieve better alignment with humans than models trained on ImageNet alone, our results indicate that scaling alone is unlikely to be sufficient to train neural networks with conceptual representations that match those used by humans.
翻译:今天的计算机视觉模型在各种各样的视觉任务中实现了人类或近乎人类水平的性能。 但是,它们的架构、数据和学习算法在许多方面与产生人类视觉的算法不同。 在本文中,我们调查影响神经网络所学表现与人类概念表达体之间一致性的因素。 人类表现从奇异的一出三重任务中的行为反应中推断出来, 人类以三种图像呈现出, 并且不得不选择奇异的一出。 我们发现模型的规模和结构对于与人类行为反应的一致性基本上没有影响, 而培训数据集和目标功能的影响则大得多。 使用稀有的巴伊西亚人类概念表达体模型,我们用区分两种类似图像和奇异一出的概念来区分三重, 发现某些概念如食物和动物在神经网络表达体中都有充分的代表性, 而另一些概念则没有被显示。 总体而言,尽管在更大规模、更多样化的数据集上培训的模型比在图像网络上培训的模型更符合人类,但我们的结果表明,光靠人类的模型是无法用足够的模型来训练。