Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks. In contrast to convolutional networks (CNNs), ViTs enable global information sharing even within shallow layers of a network, i.e., among high-resolution features. However, this perk was later overlooked with the success of pyramid architectures such as Swin Transformer, which show better performance-complexity trade-offs. In this paper, we present a simple and efficient add-on component (termed GrafT) that considers global dependencies and multi-scale information throughout the network, in both high- and low-resolution features alike. It has the flexibility of branching out at arbitrary depths and shares most of the parameters and computations of the backbone. GrafT shows consistent gains over various well-known models which includes both hybrid and pure Transformer types, both homogeneous and pyramid structures, and various self-attention methods. In particular, it largely benefits mobile-size models by providing high-level semantics. On the ImageNet-1k dataset, GrafT delivers +3.9%, +1.4%, and +1.9% top-1 accuracy improvement to DeiT-T, Swin-T, and MobileViT-XXS, respectively. Our code and models will be made available.
翻译:连枝视觉Transformer (ViT) 近来已成为许多计算机视觉任务的最先进技术。与卷积网络 (CNN) 相比,ViT 在网络的浅层之间实现全局信息共享,即在高分辨率特征之间。然而,随着金字塔结构,如Swin Transformer的成功,这个优点被忽视了,这些结构展示了更好的性能-复杂性权衡。在本文中,我们提出了一个简单而高效的附加组件(GrafT),它在整个网络中考虑全局依赖性和多尺度信息,同时适用于高分辨率和低分辨率特征。在任意深度处分支的灵活性和与主干的大部分参数和计算共享。GrafT在各种知名模型中显示出一致的收益,包括混合和纯 Transformer 类型、均匀和金字塔结构以及各种自注意方法。特别是它为移动尺寸的模型提供高级语义,从而大大受益。在ImageNet-1k 数据集上,GrafT 对 DeiT-T、Swin-T 和 MobileViT-XXS 分别提供 +3.9%、+1.4%和+1.9% 的 top-1 准确度改进。我们的代码和模型将会发布。