Siamese-based trackers have achieved excellent performance on visual object tracking. However, the target template is not updated online, and the features of the target template and search image are computed independently in a Siamese architecture. In this paper, we propose Deformable Siamese Attention Networks, referred to as SiamAttn, by introducing a new Siamese attention mechanism that computes deformable self-attention and cross-attention. The self attention learns strong context information via spatial attention, and selectively emphasizes interdependent channel-wise features with channel attention. The cross-attention is capable of aggregating rich contextual inter-dependencies between the target template and the search image, providing an implicit manner to adaptively update the target template. In addition, we design a region refinement module that computes depth-wise cross correlations between the attentional features for more accurate tracking. We conduct experiments on six benchmarks, where our method achieves new state of-the-art results, outperforming the strong baseline, SiamRPN++ [24], by 0.464->0.537 and 0.415->0.470 EAO on VOT 2016 and 2018. Our code is available at: https://github.com/msight-tech/research-siamattn.
翻译:以暹粒为基础的跟踪器在视觉物体跟踪方面表现良好。 但是,目标模板没有在线更新,目标模板和搜索图像的特征在暹粒结构中独立计算。 在本文中,我们建议采用一个新的暹粒关注机制,将可变的暹粒关注网络称为暹粒关注网络,称为SiaamAttn, 将可变自留和交叉关注计算成可变自留和交叉关注。 自我关注通过空间关注学习了强有力的背景信息,有选择地强调相互依赖的频道和频道关注。 交叉关注能够将目标模板和搜索图像之间的丰富背景相互依存性汇总起来,为适应性更新目标模板提供隐含的方式。 此外,我们设计了一个区域改进模块,将注意力特征之间的深度和交叉关联进行更准确的跟踪。 我们根据六个基准进行实验,我们的方法取得了新的艺术成果,超过了强大的基线,SiamRPN++ [24], 以0.464-437和0.415>0.470 EAO-OTERGER 2016和2018ASG/OTERAS 2016 和2018ASG_VOT/DEGISGISOLT/2018。