Compared with previous two-stream trackers, the recent one-stream tracking pipeline, which allows earlier interaction between the template and search region, has achieved a remarkable performance gain. However, existing one-stream trackers always let the template interact with all parts inside the search region throughout all the encoder layers. This could potentially lead to target-background confusion when the extracted feature representations are not sufficiently discriminative. To alleviate this issue, we propose a generalized relation modeling method based on adaptive token division. The proposed method is a generalized formulation of attention-based relation modeling for Transformer tracking, which inherits the merits of both previous two-stream and one-stream pipelines whilst enabling more flexible relation modeling by selecting appropriate search tokens to interact with template tokens. An attention masking strategy and the Gumbel-Softmax technique are introduced to facilitate the parallel computation and end-to-end learning of the token division module. Extensive experiments show that our method is superior to the two-stream and one-stream pipelines and achieves state-of-the-art performance on six challenging benchmarks with a real-time running speed.
翻译:相较于以往的双流追踪器,近期的单流追踪器允许模板与搜索区域更早地进行交互,从而取得了显著的性能提升。然而,现有的单流追踪器经常让模板在所有编码器层中与搜索区域内的所有部分进行交互。当提取的特征表示不足够具有区分性时,这可能会导致目标和背景之间的混淆。为了解决这个问题,我们提出了一种基于自适应记号划分的通用关系建模方法。该方法是基于Transformer追踪的注意力关系建模的一个通用公式,融合前两流追踪器的优点,同时通过选择适当的搜索记号与模板记号进行交互,实现了更灵活的关系建模。为了促进记号划分模块的并行计算和端到端的学习,我们引入了注意力掩蔽策略和Gumbel-Softmax技术。广泛的实验表明,我们的方法优于前两流追踪器,并在六个具有挑战性的基准测试中取得了最先进的性能,且具有实时的运行速度。