Template-based discriminative trackers are currently the dominant tracking paradigm due to their robustness, but are restricted to bounding box tracking and a limited range of transformation models, which reduces their localization accuracy. We propose a discriminative single-shot segmentation tracker -- D3S2, which narrows the gap between visual object tracking and video object segmentation. A single-shot network applies two target models with complementary geometric properties, one invariant to a broad range of transformations, including non-rigid deformations, the other assuming a rigid object to simultaneously achieve robust online target segmentation. The overall tracking reliability is further increased by decoupling the object and feature scale estimation. Without per-dataset finetuning, and trained only for segmentation as the primary output, D3S2 outperforms all published trackers on the recent short-term tracking benchmark VOT2020 and performs very close to the state-of-the-art trackers on the GOT-10k, TrackingNet, OTB100 and LaSoT. D3S2 outperforms the leading segmentation tracker SiamMask on video object segmentation benchmarks and performs on par with top video object segmentation algorithms.
翻译:基于模板的歧视性跟踪器目前因其稳健性而成为主要跟踪模式,但仅限于捆绑盒跟踪和有限的变异模型,从而降低其本地化准确性。我们建议使用一种具有歧视性的单发分解跟踪器D3S2,以缩小视觉物体跟踪和视频物体分割之间的鸿沟。一个单一的网络应用两种具有互补几何特性的目标模型,一种对广泛的变异性,包括非硬性变形,另一种假设一个硬性物体可同时实现稳健的在线目标分割。通过脱钩对象和特征比例估测,总体跟踪可靠性进一步提高。没有单数据集微调,而且仅受过作为主要输出的分解训练,D3S2优于最近短期跟踪基准VOT20的所有已公布的跟踪器,并且非常接近MAT-10k、跟踪网、OTB100和LaSoT的状态跟踪器。D3S2在视频对象分割区段的高级分解跟踪器SiamMa,在视频对象分区基准上进行最高级的演算。