ConVT: 利用软进进进进进进进两端改进视野变形器 (ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases)

Convolutional architectures have proven extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision Transformers (ViTs) rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly pre-training on large external datasets or distillation from pre-trained convolutional networks. In this paper, we ask the following question: is it possible to combine the strengths of these two architectures while avoiding their respective limitations? To this end, we introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a "soft" convolutional inductive bias. We initialize the GPSA layers to mimic the locality of convolutional layers, then give each attention head the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information. The resulting convolutional-like ViT architecture, ConViT, outperforms the DeiT on ImageNet, while offering a much improved sample efficiency. We further investigate the role of locality in learning by first quantifying how it is encouraged in vanilla self-attention layers, then analyzing how it is escaped in GPSA layers. We conclude by presenting various ablations to better understand the success of the ConViT. Our code and models are released publicly.

翻译：视觉结构在视觉任务方面被证明非常成功。它们的硬性感知偏差使得样本高效的学习成为可能降低性能上限的成本。视觉变换者(ViTs)依靠更灵活的自我关注层,最近的表现超过了CNN的图像分类。但是,它们需要花费昂贵的外部数据集前期培训,或者从预先培训的革命网络中蒸馏出来。在本文中,我们提出以下问题: 是否有可能将这两个结构的优势结合起来,同时避免各自的局限性? 为此,我们引入了门式定位自我关注(GPSA),这是一种定位自我关注的形式,可以配备“软”的演进式反感动偏差。我们开始GPA层是模拟模拟模拟,然后通过调整调控对位置和内容信息的关注度的参数,让每个人都有逃避成功的自由。由此形成的ViT结构( ConViT) 超越了图像网络的门式自我注意(GPSA), 这是一种定位自我注意的形式, 并且提供了一种“ 软性” 进化的进化的进化的进化的进化偏偏偏偏偏偏偏差。我们通过在不断的进化的进化的进化的进的进的进化的进化的进化的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进, 我们的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的