Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks. In contrast to convolutional networks (CNNs), ViTs enable global information sharing even within shallow layers of a network, i.e., among high-resolution features. However, this perk was later overlooked with the success of pyramid architectures such as Swin Transformer, which show better performance-complexity trade-offs. In this paper, we present a simple and efficient add-on component (termed GrafT) that considers global dependencies and multi-scale information throughout the network, in both high- and low-resolution features alike. GrafT can be easily adopted in both homogeneous and pyramid Transformers while showing consistent gains. It has the flexibility of branching-out at arbitrary depths, widening a network with multiple scales. This grafting operation enables us to share most of the parameters and computations of the backbone, adding only minimal complexity, but with a higher yield. In fact, the process of progressively compounding multi-scale receptive fields in GrafT enables communications between local regions. We show the benefits of the proposed method on multiple benchmarks, including image classification (ImageNet-1K), semantic segmentation (ADE20K), object detection and instance segmentation (COCO2017). Our code and models will be made available.
翻译:视觉变异器(ViTs)最近已成为许多计算机愿景任务中最先进的技术。 与革命网络(CNNs)相比,ViTs使得全球信息共享甚至在网络浅层(即高分辨率特征)中得以实现。 但是,由于Swin变异器等金字塔结构的成功,这一福利后来被忽略了,因为Swin变异器显示出更好的性能和兼容性取舍。 在本文件中,我们提出了一个简单而高效的附加部分(GrafT),它考虑到全球依赖性和整个网络的多尺度信息,包括高分辨率和低分辨率特征。GrafT可以很容易地在单一和金字塔变异器中被采用,同时显示一致的收益。它具有在任意深度进行分流的灵活性,以多种尺度扩大网络。这种变形操作使我们能够分享大部分主干线参数和计算,只增加最小的复杂度,但收益更高。 事实上,在GrafT中逐步复合多尺度可接受的域域(GrafT),包括高分辨率特征特性特征的特性模型,我们展示了多种基准段(CO)的域域域域域域的识别图段和SeADE。 我们展示了“SADADADADADADADADADAD”的可提供的分级图图段,包括了“SDADADADADADADADAD17”的图图图图图段。