The current strive towards end-to-end trainable computer vision systems imposes major challenges for the task of visual tracking. In contrast to most other vision problems, tracking requires the learning of a robust target-specific appearance model online, during the inference stage. To be end-to-end trainable, the online learning of the target model thus needs to be embedded in the tracking architecture itself. Due to these difficulties, the popular Siamese paradigm simply predicts a target feature template. However, such a model possesses limited discriminative power due to its inability of integrating background information. We develop an end-to-end tracking architecture, capable of fully exploiting both target and background appearance information for target model prediction. Our architecture is derived from a discriminative learning loss by designing a dedicated optimization process that is capable of predicting a powerful model in only a few iterations. Furthermore, our approach is able to learn key aspects of the discriminative loss itself. The proposed tracker sets a new state-of-the-art on 6 tracking benchmarks, achieving an EAO score of 0.440 on VOT2018, while running at over 40 FPS.
翻译:目前追求端到端可培训的计算机视觉系统的努力给视觉跟踪任务带来了重大挑战。 与大多数其他视觉问题不同的是,跟踪需要在推论阶段在网上学习一个强有力的目标特有外观模型。 要最终到端可培训,目标模型的在线学习就必须嵌入跟踪架构本身。由于这些困难,受欢迎的暹罗模式只是预测一个目标特征模板。然而,这样一个模型由于无法整合背景资料,具有有限的歧视性力量。我们开发了一个端到端跟踪架构,能够充分利用目标特有和背景外观信息进行目标模型预测。我们的结构来自一种歧视性学习损失,即设计出一个专门的优化过程,仅能预测几个迭代的强势模型。此外,我们的方法能够了解歧视性损失本身的关键方面。拟议的跟踪器在6追踪基准上设置了一个新的状态,在VOT2018上取得了0.440的EAO分,同时运行在40多个FPS上。