The recent success of Vision Transformers is shaking the long dominance of Convolutional Neural Networks (CNNs) in image recognition for a decade. Specifically, in terms of robustness on out-of-distribution samples, recent research finds that Transformers are inherently more robust than CNNs, regardless of different training setups. Moreover, it is believed that such superiority of Transformers should largely be credited to their self-attention-like architectures per se. In this paper, we question that belief by closely examining the design of Transformers. Our findings lead to three highly effective architecture designs for boosting robustness, yet simple enough to be implemented in several lines of code, namely a) patchifying input images, b) enlarging kernel size, and c) reducing activation layers and normalization layers. Bringing these components together, we are able to build pure CNN architectures without any attention-like operations that is as robust as, or even more robust than, Transformers. We hope this work can help the community better understand the design of robust neural architectures. The code is publicly available at https://github.com/UCSC-VLAA/RobustCNN.
翻译:视觉变异器最近的成功正在动摇革命神经网络(CNNs)十年来在图像识别方面的长期优势。 具体地说,在分配外抽样的稳健性方面,最近的研究发现,变异器与CNN相比,本质上比CNN更加强大,而不管培训设置如何。 此外,人们相信,变异器的这种优势在很大程度上应归功于其自我关注的架构本身。 在本文件中,我们通过仔细研究变异器的设计来质疑这种信念。我们的发现导致三个非常有效的建筑设计,以提升坚固性,但非常简单,可以在若干代码行中实施,即(a) 固化输入图像,(b) 扩大内核尺寸,(c) 减少激活层和正常化层。把这些组件集中起来,我们就能在没有像变异器那样强大、甚至更强大的类似操作的情况下建立纯净的CNN架构。 我们希望这项工作能够帮助社区更好地了解坚固的神经结构的设计。 该代码可在https://github.com/UCS-VLA/RobustN上公开查阅。