Vision transformers have been demonstrated to yield state-of-the-art results on a variety of computer vision tasks using attention-based networks. However, research works in transformers mostly do not investigate robustness/accuracy trade-off, and they still struggle to handle adversarial perturbations. In this paper, we explore the robustness of vision transformers against adversarial perturbations and try to enhance their robustness/accuracy trade-off in white box attack settings. To this end, we propose Locality iN Locality (LNL) transformer model. We prove that the locality introduction to LNL contributes to the robustness performance since it aggregates local information such as lines, edges, shapes, and even objects. In addition, to further improve the robustness performance, we encourage LNL to extract training signal from the moments (a.k.a., mean and standard deviation) and the normalized features. We validate the effectiveness and generality of LNL by achieving state-of-the-art results in terms of accuracy and robustness metrics on German Traffic Sign Recognition Benchmark (GTSRB) and Canadian Institute for Advanced Research (CIFAR-10). More specifically, for traffic sign classification, the proposed LNL yields gains of 1.1% and ~35% in terms of clean and robustness accuracy compared to the state-of-the-art studies.
翻译:视觉变压器已被展示为利用关注网络在各种计算机视觉任务中产生最先进的成果,然而,变压器的研究工作大多并不调查稳健/准确性交易,而且仍然在努力处理对抗性扰动。在本文中,我们探讨了视觉变压器在白箱攻击环境中的稳健性,并试图在白箱攻击环境中加强其稳健性/准确性交易。为此,我们提议了地方性 iNL 变压器(LNL) 。我们证明,对LNL的定位介绍有助于稳健性业绩,因为它汇集了线、边缘、形状、甚至目标等当地信息。此外,为了进一步提高稳健性业绩,我们鼓励LNL从当时(a.k.a.a.,平均值和标准偏离)和常规性特征中提取培训信号。我们验证LNL的实效和一般性,方法是在德国交通信号识别基准(GTSR35)的准确性和稳健性度度衡量结果方面实现最新结果,具体来说,是加拿大高级研究所(GTSRB)的准确性基准(比较性)的准确性,并具体地标明了L级的准确性研究(arr-R)的准确性)的准确性(ax-r-r-lax-lax-lax)的成绩)。