We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization. Our method leverages global context self-attention modules, joint with local self-attention, to effectively yet efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows. In addition, we address the issue of lack of the inductive bias in ViTs via proposing to use a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the tiny, small and base variants of GC ViT with 28M, 51M and 90M parameters achieve 83.3%, 83.9% and 84.5% Top-1 accuracy, respectively, surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based Swin Transformer by a large margin. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation using MS COCO and ADE20K datasets outperform prior work consistently, sometimes by large margins. Code available at https://github.com/NVlabs/GCViT.
翻译:我们提出全球背景视觉变压器(GCVit),这是一个能够提高参数和计算利用率的新结构。我们的方法利用全球背景自我注意模块,与当地自省联手,有效而高效地模拟长距离和短距离空间互动,而不需要花费昂贵的操作,如计算注意力遮罩或移动本地窗口。此外,我们通过提议使用一个修改过的反向残余块来解决VIT缺乏感化偏差的问题。我们提议的GCVit在图像分类、物体探测和语义分解任务中达到最新水平。在图像网-1K数据分类中,GVIT的微小、小型和基本变异(28M、51M和90M参数)分别达到83.3%、83.9%和84.5%的顶端-1精度。我们通过提议使用CNNConNEXt和VIT的Swin变压器等可比较规模的原有艺术,通过一个大利润幅度,在物体探测、试样分解、有时使用MS CO20 CD/Semanmaximal Supreal Secal 上,有时使用MS/dection MS.