While deep-learning based methods for visual tracking have achieved substantial progress, these schemes entail large-scale and high-quality annotated data for sufficient training. To eliminate expensive and exhaustive annotation, we study self-supervised learning for visual tracking. In this work, we develop the Crop-Transform-Paste operation, which is able to synthesize sufficient training data by simulating various kinds of scene variations during tracking, including appearance variations of objects and background changes. Since the object state is known in all synthesized data, existing deep trackers can be trained in routine ways without human annotation. Different from typical self-supervised learning methods performing visual representation learning as an individual step, the proposed self-supervised learning mechanism can be seamlessly integrated into any existing tracking framework to perform training. Extensive experiments show that our method 1) achieves favorable performance than supervised learning in few-shot tracking scenarios; 2) can deal with various tracking challenges such as object deformation, occlusion, or background clutter due to its design; 3) can be combined with supervised learning to further boost the performance, particularly effective in few-shot tracking scenarios.
翻译:虽然以深层次学习为基础的视觉跟踪方法取得了重大进展,但这些计划需要大规模和高质量的附加说明数据,以进行充分的培训。为了消除昂贵和详尽的说明,我们研究自我监督学习,以进行视觉跟踪。在这项工作中,我们开发了作物-变形-纸板操作,通过模拟跟踪过程中的各种场景变化,包括物体的外观变化和背景变化,能够综合足够的培训数据。由于所有综合数据中都了解物体状态,现有的深层跟踪者可以在没有人类批注的情况下以常规方式接受培训。与作为单独步骤进行视觉演示学习的典型自我监督学习方法不同,拟议中的自我监督学习机制可以顺利地纳入任何现有的培训跟踪框架。广泛的实验表明,我们的方法(1)比在几发跟踪情景中监督学习取得优异的成绩;(2)可以应对各种跟踪挑战,如物体变形、隔离或因设计而产生的背景杂乱;3)可以与监督性学习相结合,以进一步提升业绩,特别是在几发跟踪情景中有效。