Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision, yet we understand very little about why they work and what they learn. While existing studies visually analyze the mechanisms of convolutional neural networks, an analogous exploration of ViTs remains challenging. In this paper, we first address the obstacles to performing visualizations on ViTs. Assisted by these solutions, we observe that neurons in ViTs trained with language model supervision (e.g., CLIP) are activated by semantic concepts rather than visual features. We also explore the underlying differences between ViTs and CNNs, and we find that transformers detect image background features, just like their convolutional counterparts, but their predictions depend far less on high-frequency information. On the other hand, both architecture types behave similarly in the way features progress from abstract patterns in early layers to concrete objects in late layers. In addition, we show that ViTs maintain spatial information in all layers except the final layer. In contrast to previous works, we show that the last layer most likely discards the spatial information and behaves as a learned global pooling operation. Finally, we conduct large-scale visualizations on a wide range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin, to validate the effectiveness of our method.
翻译:视觉变压器( Viet 变压器) 正在迅速成为计算机视觉的脱法结构, 但我们却很少理解它们为什么工作以及学习什么。 虽然现有的研究对神经神经神经网络的功能进行视觉分析, 但类似的ViT的探索仍然具有挑战性。 在本文中, 我们首先解决在ViT上进行视觉化的障碍。 在这些解决方案的帮助下, 我们观察到, 接受语言模型监督( 例如 CLIP) 培训的ViT 神经元是通过语义概念而不是视觉特征来激活的。 我们还探索了ViT 和CNN 之间的根本差异, 我们发现变压器检测图像背景特征, 就像它们的神经神经网络一样, 但它们的预测依赖于高频率信息。 另一方面, 这两种结构类型在从早期的抽象模式到后层的混凝土物体的进展方式上表现相似。 此外, 我们显示 ViT 在所有层次上都保留了空间信息, 除了最后一层。 与以往的工程相比, 我们发现最后一层最有可能丢掉空间信息以及图像背景特征, 包括ViT 范围、 Con- 等大型的视觉变动, 我们进行大规模的Vi- 。