We present a Siamese-like Dual-branch network based on solely Transformers for tracking. Given a template and a search image, we divide them into non-overlapping patches and extract a feature vector for each patch based on its matching results with others within an attention window. For each token, we estimate whether it contains the target object and the corresponding size. The advantage of the approach is that the features are learned from matching, and ultimately, for matching. So the features are aligned with the object tracking task. The method achieves better or comparable results as the best-performing methods which first use CNN to extract features and then use Transformer to fuse them. It outperforms the state-of-the-art methods on the GOT-10k and VOT2020 benchmarks. In addition, the method achieves real-time inference speed (about $40$ fps) on one GPU. The code and models will be released.
翻译:我们展示了一个类似暹罗的双部门网络, 其基础仅仅是用于跟踪的变换器。 根据一个模板和搜索图像, 我们将其分成一个不重叠的补丁, 并根据每个补丁结果与关注窗口中其他补丁的匹配结果为每个补丁提取一个特性矢量。 对于每个符号, 我们估计它是否包含目标对象和相应的大小。 这种方法的优势在于从匹配和最终匹配中学习了这些特性。 因此, 特性与对象跟踪任务一致。 这种方法取得了更好或可比的结果, 因为它首先使用CNN来提取功能, 然后使用变换器来将其结合。 它比GOT- 10k 和 VOT-2020 基准中的最新方法要强。 此外, 该方法还可以实现一个 GPU 上的实时推断速度( 约40 fps) 。 代码和模型将会被发布 。