Though image transformers have shown competitive results with convolutional neural networks in computer vision tasks, lacking inductive biases such as locality still poses problems in terms of model efficiency especially for embedded applications. In this work, we address this issue by introducing attention masks to incorporate spatial locality into self-attention heads. Local dependencies are captured efficiently with masked attention heads along with global dependencies captured by unmasked attention heads. With Masked attention image Transformer - MaiT, top-1 accuracy increases by up to 1.7% compared to CaiT with fewer parameters and FLOPs, and the throughput improves by up to 1.5X compared to Swin. Encoding locality with attention masks is model agnostic, and thus it applies to monolithic, hierarchical, or other novel transformer architectures.
翻译:虽然图像变压器在计算机视觉任务中表现出了与进化神经网络的竞争性结果,但缺乏感应偏差(如地点)仍然在模型效率方面造成问题,特别是对于嵌入应用程序而言。 在这项工作中,我们通过引入注意面罩将空间位置纳入自我关注头部来解决这一问题。 当地依赖物以蒙面关注头来有效捕捉,而全球依赖物则由无面关注头来捕捉。 在蒙面关注面图像变形器 - MaiT(MaiT)下,顶层一级精确度比CaiT(CaiT)增加了1.7%,参数和FLOPs(FLOPs)则更少,而吞吐量比Swin(Swin)提高了1.5X。 使用注意面的编码地点是模型不可知性,因此它适用于单一、等级或其他新型变压器结构。