Convolutional Neural Networks (CNNs) are commonly thought to recognise objects by learning increasingly complex representations of object shapes. Some recent studies suggest a more important role of image textures. We here put these conflicting hypotheses to a quantitative test by evaluating CNNs and human observers on images with a texture-shape cue conflict. We show that ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies. We then demonstrate that the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on "Stylized-ImageNet", a stylized version of ImageNet. This provides a much better fit for human behavioural performance in our well-controlled psychophysical lab setting (nine experiments totalling 48,560 psychophysical trials across 97 observers) and comes with a number of unexpected emergent benefits such as improved object detection performance and previously unseen robustness towards a wide range of image distortions, highlighting advantages of a shape-based representation.
翻译:人们通常认为,通过学习日益复杂的物体形状表征来识别物体。最近的一些研究显示,图像纹理的作用更为重要。我们在这里将这些相互矛盾的假设通过评估CNN和人类观察家在带有纹理-形状提示冲突的图像上进行定量测试。我们显示,经过图像网络培训的CNN强烈偏向于识别质谱,而不是形状,这与人类行为证据形成鲜明对比,并揭示了根本不同的分类战略。我们然后表明,在图像网络上学习基于纹理的表示法的同一标准结构(ResNet-50)能够学习基于形状的表示法,而不是在接受“Stylized-ImaageNet”培训时学习基于形状的表示法,这是一个图像网络的系统化版本。这更适合我们精心控制的心理物理实验室环境中的人类行为表现(97名观察者共进行9次实验,共48,560次心理物理试验),并带来一些意外的突发的惠益,例如改进的物体探测性能和以前看不见的稳健度,从而实现广泛的图像扭曲,突出基于形状的代表法理的优势。