放大愿景变换器 (Grafting Vision Transformers)

Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks. In contrast to convolutional networks (CNNs), ViTs enable global information sharing even within shallow layers of a network, i.e., among high-resolution features. However, this perk was later overlooked with the success of pyramid architectures such as Swin Transformer, which show better performance-complexity trade-offs. In this paper, we present a simple and efficient add-on component (termed GrafT) that considers global dependencies and multi-scale information throughout the network, in both high- and low-resolution features alike. GrafT can be easily adopted in both homogeneous and pyramid Transformers while showing consistent gains. It has the flexibility of branching-out at arbitrary depths, widening a network with multiple scales. This grafting operation enables us to share most of the parameters and computations of the backbone, adding only minimal complexity, but with a higher yield. In fact, the process of progressively compounding multi-scale receptive fields in GrafT enables communications between local regions. We show the benefits of the proposed method on multiple benchmarks, including image classification (ImageNet-1K), semantic segmentation (ADE20K), object detection and instance segmentation (COCO2017). Our code and models will be made available.

翻译：视觉变异器(ViTs)最近已成为许多计算机愿景任务中最先进的技术。与革命网络(CNNs)相比,ViTs使得全球信息共享甚至在网络浅层(即高分辨率特征)中得以实现。但是,由于Swin变异器等金字塔结构的成功,这一福利后来被忽略了,因为Swin变异器显示出更好的性能和兼容性取舍。在本文件中,我们提出了一个简单而高效的附加部分(GrafT),它考虑到全球依赖性和整个网络的多尺度信息,包括高分辨率和低分辨率特征。GrafT可以很容易地在单一和金字塔变异器中被采用,同时显示一致的收益。它具有在任意深度进行分流的灵活性,以多种尺度扩大网络。这种变形操作使我们能够分享大部分主干线参数和计算,只增加最小的复杂度,但收益更高。事实上,在GrafT中逐步复合多尺度可接受的域域(GrafT),包括高分辨率特征特性特征的特性模型,我们展示了多种基准段(CO)的域域域域域域的识别图段和SeADE。我们展示了“SADADADADADADADADADAD”的可提供的分级图图段,包括了“SDADADADADADADADAD17”的图图图图图段。

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【ICCV 2021 】Vision Transformer中的相对位置编码

专知会员服务

30+阅读 · 2021年7月30日

最新《Transformers模型》教程，64页ppt

专知会员服务

320+阅读 · 2020年11月26日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

80+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日