Self-attention is powerful in modeling long-range dependencies, but it is weak in local finer-level feature learning. The performance of local self-attention (LSA) is just on par with convolution and inferior to dynamic filters, which puzzles researchers on whether to use LSA or its counterparts, which one is better, and what makes LSA mediocre. To clarify these, we comprehensively investigate LSA and its counterparts from two sides: \emph{channel setting} and \emph{spatial processing}. We find that the devil lies in the generation and application of spatial attention, where relative position embeddings and the neighboring filter application are key factors. Based on these findings, we propose the enhanced local self-attention (ELSA) with Hadamard attention and the ghost head. Hadamard attention introduces the Hadamard product to efficiently generate attention in the neighboring case, while maintaining the high-order mapping. The ghost head combines attention maps with static matrices to increase channel capacity. Experiments demonstrate the effectiveness of ELSA. Without architecture / hyperparameter modification, drop-in replacing LSA with ELSA boosts Swin Transformer \cite{swin} by up to +1.4 on top-1 accuracy. ELSA also consistently benefits VOLO \cite{volo} from D1 to D5, where ELSA-VOLO-D5 achieves 87.2 on the ImageNet-1K without extra training images. In addition, we evaluate ELSA in downstream tasks. ELSA significantly improves the baseline by up to +1.9 box Ap / +1.3 mask Ap on the COCO, and by up to +1.9 mIoU on the ADE20K. Code is available at \url{https://github.com/damo-cv/ELSA}.
翻译:本地自我关注(LSA)的性能与进化和动态过滤器的低劣性能相当,这使得研究人员对是否使用LSA(LSA)及其对应方感到疑惑。为了澄清这一点,我们全面调查LSA(LSA)及其来自两个侧面的对应方:\emph{chanel settle} 和\emph{LEF1 流化处理}。我们发现,魔鬼在于空间关注的生成和应用,其中相对位置嵌入和邻近过滤器应用程序是关键因素。基于这些发现,我们建议用Hadamard(ISA)或幽灵头来强化本地自我关注(ELSA) 。 Hadarmard(Hadamard) 将Hadmard(Hadmard) 产品高效地吸引到邻近案件的注意力,同时保持DVA(D)O(DVO) 地图与静态矩阵相结合,在SELA(LA)/(ISA)上进行关注,在SA(SA)/超高端训练(SLA(IOL) ) 基线上进行大幅的升级。