RGB-T tracking involves the use of images from both visible and thermal modalities. The primary objective is to adaptively leverage the relatively dominant modality in varying conditions to achieve more robust tracking compared to single-modality tracking. An RGB-T tracker based on mixed attention mechanism to achieve complementary fusion of modalities (referred to as MACFT) is proposed in this paper. In the feature extraction stage, we utilize different transformer backbone branches to extract specific and shared information from different modalities. By performing mixed attention operations in the backbone to enable information interaction and self-enhancement between the template and search images, it constructs a robust feature representation that better understands the high-level semantic features of the target. Then, in the feature fusion stage, a modality-adaptive fusion is achieved through a mixed attention-based modality fusion network, which suppresses the low-quality modality noise while enhancing the information of the dominant modality. Evaluation on multiple RGB-T public datasets demonstrates that our proposed tracker outperforms other RGB-T trackers on general evaluation metrics while also being able to adapt to longterm tracking scenarios.
翻译:RGB-T跟踪涉及使用可见光和热成像两种图像模式。主要目标是在不同条件下自适应利用相对占主导地位的模式,以比单模态跟踪实现更加稳健的跟踪。本文提出了基于混合注意力机制的RGB-T追踪器(简称MACFT),以实现模态的互补融合。在特征提取阶段,我们利用不同的转换器主干支路从不同的模式中提取特定和共享信息。通过在主干中执行混合注意操作,使模板和搜索图像之间的信息交互和自我增强,构建一个稳健的特征表示,更好地理解目标的高级语义特征。然后,在特征融合阶段,通过基于混合注意的模态融合网络实现模态自适应融合,抑制低质量模态噪声,同时增强主导模态的信息。对多个RGB-T公共数据集的评估表明,我们提出的跟踪器在常规评估指标上优于其他RGB-T跟踪器,同时也能够适应长期跟踪场景。