Vision Transformers have achieved state-of-the-art performance in many visual tasks. Due to the quadratic computational and memory complexities of self-attention, recent works either apply attention only to low-resolution inputs or restrict the receptive field to a small local region. To overcome these limitations, we propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights, modeling local-global interactions in all stages. Key-only attention has linear computational and memory complexities w.r.t input size. We use alternate layout to hybridize convolution and attention layers instead of grafting which is suggested by previous works, so that all stages can benefit from both spatial attentions and convolutions. We leverage these improvements to develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark, and outperform baselines significantly in downstream tasks, e.g., COCO object detection and ADE20K semantic segmentation.
翻译:视觉变异器在许多视觉任务中取得了最先进的性能。由于自我注意的二次计算和记忆复杂性,最近的工作要么只关注低分辨率投入,要么将可接受领域限制在小的当地区域。为了克服这些限制,我们建议只关注关键因素,这排除了对称问答互动,并使用一个计算高效的显著门来获得注意的权重,在所有阶段都建模地方-全球互动。只关注关键因素有线性计算和记忆复杂性输入大小。我们使用替代布局来混合凝聚和关注层,而不是以前的工作所建议的粘合层,以便所有阶段都能从空间关注和组合中受益。我们利用这些改进方法开发一个新的自控模式型家族,即LinGlos,它能够达到受参数限制的图像网络分类基准的状态,在下游任务中,例如COCO天体探测和ADE20K语文断裂中,其基线明显超出常规。