Many visual phenomena suggest that humans use top-down generative or reconstructive processes to create visual percepts (e.g., imagery, object completion, pareidolia), but little is known about the role reconstruction plays in robust object recognition. We built an iterative encoder-decoder network that generates an object reconstruction and used it as top-down attentional feedback to route the most relevant spatial and feature information to feed-forward object recognition processes. We tested this model using the challenging out-of-distribution digit recognition dataset, MNIST-C, where 15 different types of transformation and corruption are applied to handwritten digit images. Our model showed strong generalization performance against various image perturbations, on average outperforming all other models including feedforward CNNs and adversarially trained networks. Our model is particularly robust to blur, noise, and occlusion corruptions, where shape perception plays an important role. Ablation studies further reveal two complementary roles of spatial and feature-based attention in robust object recognition, with the former largely consistent with spatial masking benefits in the attention literature (the reconstruction serves as a mask) and the latter mainly contributing to the model's inference speed (i.e., number of time steps to reach a certain confidence threshold) by reducing the space of possible object hypotheses. We also observed that the model sometimes hallucinates a non-existing pattern out of noise, leading to highly interpretable human-like errors. Our study shows that modeling reconstruction-based feedback endows AI systems with a powerful attention mechanism, which can help us understand the role of generating perception in human visual processing.
翻译:许多视觉现象表明,人类使用自上而下自上而下的基因变异或重建过程来创造视觉感知(例如图像、目标完成、paraidolia),但对于重建在强烈的物体识别方面所起的作用却知之甚少。我们建立了一个迭接的编码解码网络,产生物体重建,并将它用作自上而下的关注反馈,将最相关的空间和特征信息引向向向向向向向的物体识别过程。我们用具有挑战性的从分配数字识别数据集(MNIST-C)测试了这一模型,其中15种不同的变异和腐败应用到手写的数字图像。我们的模型显示,在各种图像扰动中,各种图像扰动作用是强烈的。我们建造了一个超强的直观性表现,平均表现优于所有其他模型,包括向前的CNN和对立式训练的网络。我们的模型对于模糊、噪音和封闭式的腐败尤其具有自上而下的反馈作用,形成一种重要的认知作用。我们进行实验的研究进一步揭示了空间和基于特征的注意的注意的两种相辅相成的作用,而前者基本上与注意力的表面遮掩看的文献中的注意好处(重建模型作为我们的一个镜面的模型的模型, ) 和后期的模型,我们可以显示一个高度的脚印的脚印的脚印的刻刻刻值。