Compared to human vision, computer vision based on convolutional neural networks (CNN) are more vulnerable to adversarial noises. This difference is likely attributable to how the eyes sample visual input and how the brain processes retinal samples through its dorsal and ventral visual pathways, which are under-explored for computer vision. Inspired by the brain, we design recurrent neural networks, including an input sampler that mimics the human retina, a dorsal network that guides where to look next, and a ventral network that represents the retinal samples. Taking these modules together, the models learn to take multiple glances at an image, attend to a salient part at each glance, and accumulate the representation over time to recognize the image. We test such models for their robustness against a varying level of adversarial noises with a special focus on the effect of different input sampling strategies. Our findings suggest that retinal foveation and sampling renders a model more robust against adversarial noises, and the model may correct itself from an attack when it is given a longer time to take more glances at an image. In conclusion, robust visual recognition can benefit from the combined use of three brain-inspired mechanisms: retinal transformation, attention guided eye movement, and recurrent processing, as opposed to feedforward-only CNNs.
翻译:与人类的视觉相比,基于进化神经网络(CNN)的计算机视觉更容易受到对抗性噪音的影响。这种差异可能归因于眼睛样本的视觉输入方式,以及大脑如何通过计算机视觉探索不足的背心和呼吸视觉路径处理视样。在大脑的启发下,我们设计经常性神经网络,包括模仿人类视网膜的输入取样器、引导下一个取向的感应器网络以及代表视视网膜样本的呼吸网络。把这些模块结合在一起,模型学会对图像进行多次透视,每次看一个突出的部分,并积累一段时间内辨识图像的代表性。我们测试这些模型是否稳健,以不同程度的对立心声噪音为特别焦点,特别侧重于不同输入抽样战略的影响。我们的研究结果表明,视网迷和采样使模型更强大地抵御对视网膜的噪音,而模型可以从攻击中纠正自己。当给予较长的时间对图像进行更仔细的观察时,每眼部观察一个突出的部分,并积累出辨别的面图象。在结论中,稳健的视觉认知认识可以使大脑的感知觉意识运动从攻击得到更好的三次的注意。