We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization. Our method leverages global context self-attention modules, joint with local self-attention, to effectively yet efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows. In addition, we address the issue of lack of the inductive bias in ViTs via proposing to use a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the base, small and tiny variants of GC ViT with $28$M, $51$M and $90$M parameters achieve $\textbf{83.2\%}$, $\textbf{83.9\%}$ and $\textbf{84.4\%}$ Top-1 accuracy, respectively, surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based Swin Transformer by a large margin. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation using MS COCO and ADE20K datasets outperform prior work consistently, sometimes by large margins. Code available at https://github.com/NVlabs/GCViT.
翻译:我们提出全球背景视觉变压器(GCViT),这是一个能够提高参数和计算利用率的新结构。我们的方法利用全球背景自我注意模块,与当地自我意识联合,有效而高效地模拟长距离和短距离空间互动,而不需要花费昂贵的操作,例如计算注意力面罩或移动当地窗口。此外,我们通过提议使用一个经过修改的连接反倒残余块来解决维特缺乏感性偏向的问题。我们提议的GCVT在图像分类、物体探测和语义分解任务中达到了最新水平的直径。在图像Net-1K数据库中,用于分类的基数、基数、小微变异的GCViT,使用28美元M、51美元M和90M参数进行计算。此外,我们解决了维特偏差中缺乏感性偏差的问题,为此,我们提议在我们架构中使用一个经过修改的自转的自转残余块块块块。我们提出的CNNNCON NEXt 和ViT 以自控为主的SwinT的Swin Tryferververb Sylover 数据,通过一个大基段进行长期的Settyleval Steal 和Segregrestr 。