Visual object tracking often employs a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, in this paper, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer trackers simply by stacking multiple MAMs and placing a localization head on top. Specifically, we instantiate two types of MixFormer trackers, a hierarchical tracker MixCvT, and a non-hierarchical tracker MixViT. For these two trackers, we investigate a series of pre-training methods and uncover the different behaviors between supervised pre-training and self-supervised pre-training in our MixFormer trackers. We also extend the masked pre-training to our MixFormer trackers and design the competitive TrackMAE pre-training technique. Finally, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer trackers set a new state-of-the-art performance on seven tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, OTB100 and UAV123. In particular, our MixViT-L achieves AUC score of 73.3% on LaSOT, 86.1% on TrackingNet, EAO of 0.584 on VOT2020, and AO of 75.7% on GOT-10k. Code and trained models will be made available at https://github.com/MCG-NJU/MixFormer.
翻译:视觉对象跟踪通常使用多阶段地物提取管道、目标信息整合和捆绑框估计。 为了简化此管道并统一地物提取和目标信息整合流程, 我们在此文件中展示了一个以变压器为基础的称为MixFormer的紧凑跟踪框架。 我们的核心设计是利用关注操作的灵活性, 并提出一个用于同步地物提取和目标信息整合的混合关注模块( MAM ) 。 这个同步模型计划可以提取特定目标的歧视性功能, 并在目标区和搜索区之间进行广泛的沟通。 基于 MAM, 我们仅仅通过堆叠多个MAMOM 和在顶部放置一个本地化的功能提取和目标信息整合程序来建立我们的MixForFor 。 我们的核心设计是MixCRix Mix。 对于这两个跟踪者来说, 我们将一系列的预培训方法, 并发现在监控前/ 84MNetMMFortreal跟踪器上的不同行为。 我们还在MIA- Rodeal Treal Sal 上将我们前的轨道跟踪工具升级到Mix 。