We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure, i.e., they process an image as a sequence of image patches. We find that ViTs are surprisingly insensitive to patch-based transformations, even when the transformation largely destroys the original semantics and makes the image unrecognizable by humans. This indicates that ViTs heavily use features that survived such transformations but are generally not indicative of the semantic class to humans. Further investigations show that these features are useful but non-robust, as ViTs trained on them can achieve high in-distribution accuracy, but break down under distribution shifts. From this understanding, we ask: can training the model to rely less on these features improve ViT robustness and out-of-distribution performance? We use the images transformed with our patch-based operations as negatively augmented views and offer losses to regularize the training away from using non-robust features. This is a complementary view to existing research that mostly focuses on augmenting inputs with semantic-preserving transformations to enforce models' invariance. We show that patch-based negative augmentation consistently improves robustness of ViTs across a wide set of ImageNet based robustness benchmarks. Furthermore, we find our patch-based negative augmentation are complementary to traditional (positive) data augmentation, and together boost the performance further.
翻译:我们通过基于补丁的特殊建筑结构透镜来调查视觉变压器(ViTs)的稳健性。 我们发现ViTs对基于补丁的变异非常不敏感,即使这种变异在很大程度上摧毁了原有的语义,使图像无法被人类所识别。 这表明ViTs大量使用在这种变异中幸存下来的特征,但一般并不显示人类的语义等级。 进一步的调查显示,这些特征是有用的,但非机器人的。 ViTs培训的这些特征可以达到高分布精确度,但可以打破分布变化的顺序。 我们从这一理解中发现, 维特变变变变模式可以减少对这些特征的依赖, 提高原始语义的稳健和超出分配的性能。 我们用这种变异图像作为负面的视图, 并造成损失, 使培训从使用非紫色的语义类对人进行规范。 这是对现有研究的一种补充性观点, 其重点主要是增加投入, 以语义保留性变异性变的转化, 以强化性变固性模型为基础。 我们发现, 持续地改进基于常态变固性变固性变固的图像。