Most feedforward convolutional neural networks spend roughly the same efforts for each pixel. Yet human visual recognition is an interaction between eye movements and spatial attention, which we will have several glimpses of an object in different regions. Inspired by this observation, we propose an end-to-end trainable Multi-Glimpse Network (MGNet) which aims to tackle the challenges of high computation and the lack of robustness based on recurrent downsampled attention mechanism. Specifically, MGNet sequentially selects task-relevant regions of an image to focus on and then adaptively combines all collected information for the final prediction. MGNet expresses strong resistance against adversarial attacks and common corruptions with less computation. Also, MGNet is inherently more interpretable as it explicitly informs us where it focuses during each iteration. Our experiments on ImageNet100 demonstrate the potential of recurrent downsampled attention mechanisms to improve a single feedforward manner. For example, MGNet improves 4.76% accuracy on average in common corruptions with only 36.9% computational cost. Moreover, while the baseline incurs an accuracy drop to 7.6%, MGNet manages to maintain 44.2% accuracy in the same PGD attack strength with ResNet-50 backbone. Our code is available at https://github.com/siahuat0727/MGNet.
翻译:大多数前馈卷积神经网络为每个像素花费的工作量相当。但是,人类视觉识别是眼动与空间注意之间的相互作用,我们将在不同区域中看到一个物体的几个瞥视。受到这一观察的启发,我们提出了一个端到端可训练的多角度网络(MGNet),旨在通过循环降采样的注意机制解决高计算和缺乏鲁棒性的挑战。具体而言,MGNet顺序选择要关注的图像的任务相关区域,然后自适应地组合收集到的所有信息进行最终预测。MGNet在对抗性攻击和常见的损坏情况下表现出很强的抗性,并具有较少的计算。此外,MGNet内在上更易于解释,因为它在每次迭代时都明确地告诉我们它的焦点在哪里。我们在ImageNet-100上的实验证明,循环降采样的注意机制可以提高单个前馈方式的性能。例如,MGNet用36.9%的计算成本平均提高了4.76%的常见损坏精度。此外,当基线精度降至7.6%时,MGNet在相同的PGD攻击强度下仍能够保持44.2%的精度,具有很大的应用潜力。我们的代码可在https://github.com/siahuat0727/MGNet获得。