Multi-modal tracking gains attention due to its ability to be more accurate and robust in complex scenarios compared to traditional RGB-based tracking. Its key lies in how to fuse multi-modal data and reduce the gap between modalities. However, multi-modal tracking still severely suffers from data deficiency, thus resulting in the insufficient learning of fusion modules. Instead of building such a fusion module, in this paper, we provide a new perspective on multi-modal tracking by attaching importance to the multi-modal visual prompts. We design a novel multi-modal prompt tracker (ProTrack), which can transfer the multi-modal inputs to a single modality by the prompt paradigm. By best employing the tracking ability of pre-trained RGB trackers learning at scale, our ProTrack can achieve high-performance multi-modal tracking by only altering the inputs, even without any extra training on multi-modal data. Extensive experiments on 5 benchmark datasets demonstrate the effectiveness of the proposed ProTrack.
翻译:与传统的基于RGB的跟踪相比,多模式跟踪在复杂的情景中更准确、更稳健,因此得到关注。关键在于如何整合多模式数据并缩小模式之间的差距。然而,多模式跟踪仍然严重缺乏数据,从而导致对聚合模块的学习不足。本文没有建立这样一个聚合模块,而是通过重视多模式视觉提示,为多模式跟踪提供了一个新视角。我们设计了一个新的多模式快速跟踪器(ProTrack ),它可以通过快速模式将多模式投入转换到单一模式。通过最充分地利用预先培训的RGB跟踪者大规模学习的跟踪能力,我们的ProTrack能够实现高绩效的多模式跟踪,只改变投入,即使没有就多模式数据进行任何额外培训。关于5个基准数据集的广泛实验显示了拟议的ProTrack的有效性。