Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer vision, its efficacy is not well studied and even remains controversial, e.g., whether relative position encoding can work equally well as absolute position? In order to clarify this, we first review existing relative position encoding methods and analyze their pros and cons when applied in vision transformers. We then propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE). Our methods consider directional relative distance modeling as well as the interactions between queries and relative position embeddings in self-attention mechanism. The proposed iRPE methods are simple and lightweight. They can be easily plugged into transformer blocks. Experiments demonstrate that solely due to the proposed encoding methods, DeiT and DETR obtain up to 1.5% (top-1 Acc) and 1.3% (mAP) stable improvements over their original versions on ImageNet and COCO respectively, without tuning any extra hyperparameters such as learning rate and weight decay. Our ablation and analysis also yield interesting findings, some of which run counter to previous understanding. Code and models are open-sourced at https://github.com/microsoft/Cream/tree/main/iRPE.
翻译:相对位置编码( RPE) 对变压器来说非常重要 。 一般效果在自然语言处理中已经证明。 但是,在计算机视觉中,其功效没有得到很好研究,甚至仍然有争议,例如,相对位置编码能否同样和绝对位置发挥作用? 为了澄清这一点,我们首先审查现有的相对位置编码方法,并在对变压器应用时分析其利弊。然后我们提议2D图像的新的相对位置编码方法,称为图像 RPE( IRPE ) 。我们的方法考虑到方向相对模型以及查询和嵌入自留机制的相对位置之间的相互作用。提议的 iRPE 方法简单而轻重。 它们很容易被插入变压区块。 实验表明,仅仅由于拟议的编码方法, DeiT 和DETR 获得高达1.5%( 顶 - 1 Acc) 和 1. 3 % ( mAP) 的稳定改进。 我们的方法考虑方向相对距离的模型以及查询速度和重量衰减等任何额外参数。 提议的 iRPE, 我们的 ablationalation and eximal resmal amalalalalalalal resmalationalationsupalationsal 。