愿景转换器的引人注意的属性 (Intriguing Properties of Vision Transformers)

Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. An important question is how such flexibility in attending image-wide context conditioned on a given patch can facilitate handling nuisances in natural images e.g., severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations. We systematically study this question via an extensive set of experiments encompassing three ViT families and comparisons with a high-performing convolutional neural network (CNN). We show and analyze the following intriguing properties of ViT: (a) Transformers are highly robust to severe occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1 accuracy on ImageNet even after randomly occluding 80% of the image content. (b) The robust performance to occlusions is not due to a bias towards local textures, and ViTs are significantly less biased towards textures compared to CNNs. When properly trained to encode shape-based features, ViTs demonstrate shape recognition capability comparable to that of human visual system, previously unmatched in the literature. (c) Using ViTs to encode shape representation leads to an interesting consequence of accurate semantic segmentation without pixel-level supervision. (d) Off-the-shelf features from a single ViT model can be combined to create a feature ensemble, leading to high accuracy rates across a range of classification datasets in both traditional and few-shot learning paradigms. We show effective features of ViTs are due to flexible and dynamic receptive fields possible via the self-attention mechanism.

翻译：视觉变异器( ViT) 在各种机器视觉问题中表现出了令人印象深刻的性能。这些模型基于多头自我关注机制, 可以灵活地关注一系列图像补丁以编码背景提示。一个重要问题是, 以给定补丁为条件的图像处理全图像背景的灵活性如何有助于处理自然图像中的干扰, 例如, 严重隐蔽、域变换、空间变换、对抗性和自然扰动。我们系统地研究这一问题。我们通过一系列广泛的实验, 包括三个 ViT 家族, 并与高性能的卷心神经网络( CNN) 比较。我们显示和分析 ViT 的以下令人着迷的特性:(a) 变异性能高度隐蔽、扰动和域变换, 例如, 即使在随机地将图像内容的80% 混杂到图像模型的80% 。 (b) 动态变异性变性性性性性性性性性性性性性性性能并非因为对本地的纹变形有偏差, 而 ViT 的精确性变性能的精确性能性能性能性能性能则明显地向视觉变异性变异性变异性变异性能。。, 在前性变性变性变性变性能中, 显示的变性变性变性变性变性变性变性变性变性变性变性变性变性变性变性变性变性变性能,, 变性能的变性变性变性变性能的变性能的变性能的变性能显示前的变性能, 变性能的变性能显示前性能显示前性能显示前性能,,, 变性能变性能变性能变性变性变性能变性变性变性变性变性变性变性能变性能变性能变性能变性能变性能变性能变性能性能变性能性能性变性能变性能性能性能性能性能性能性能性能性能性能性能性能性