Gestalt psychologists have identified a range of conditions in which humans organize elements of a scene into a group or whole, and perceptual grouping principles play an essential role in scene perception and object identification. Recently, Deep Neural Networks (DNNs) trained on natural images (ImageNet) have been proposed as compelling models of human vision based on reports that they perform well on various brain and behavioral benchmarks. Here we test a total of 16 networks covering a variety of architectures and learning paradigms (convolutional, attention-based, supervised and self-supervised, feed-forward and recurrent) on dots (Experiment 1) and more complex shapes (Experiment 2) stimuli that produce strong Gestalts effects in humans. In Experiment 1 we found that convolutional networks were indeed sensitive in a human-like fashion to the principles of proximity, linearity, and orientation, but only at the output layer. In Experiment 2, we found that most networks exhibited Gestalt effects only for a few sets, and again only at the latest stage of processing. Overall, self-supervised and Vision-Transformer appeared to perform worse than convolutional networks in terms of human similarity. Remarkably, no model presented a grouping effect at the early or intermediate stages of processing. This is at odds with the widespread assumption that Gestalts occur prior to object recognition, and indeed, serve to organize the visual scene for the sake of object recognition. Our overall conclusion is that, albeit noteworthy that networks trained on simple 2D images support a form of Gestalt grouping for some stimuli at the output layer, this ability does not seem to transfer to more complex features. Additionally, the fact that this grouping only occurs at the last layer suggests that networks learn fundamentally different perceptual properties than humans.
翻译:Gestalt 心理学家已经确定了人类将场景元素组织成一组或整体的一系列条件,以及感知组合原则在场景特征和物体识别方面发挥着不可或缺的作用。 最近,深神经网络(DNNS)在自然图像(ImagetNet)方面受过培训,被推荐为具有说服力的人类视觉模型(DNNS ), 其依据的报告是,它们在不同大脑和行为基准方面表现良好。 我们在这里测试了总共16个网络, 涵盖各种架构和学习范式( 革命性、 关注性、 监督和自我监督、 向上和经常性), 以及更复杂的形状( 实验性能 ) 。 最近, 深神经网络的刺激性能(DNNNNNNN) 被推荐为具有很强的自然图像效果。 在实验1中, 革命性网络在人类接近性、 线性、 方向方面确实敏感, 只是在产出层中, 我们发现大多数网络只为少数而表现出Gestalat 效果, 而只是处于最新处理阶段。 。 在总体的轨道上,, 最糟糕的网络在结构上, 显示着更接近性 的图像处理能力 。