Deep learning based visual trackers entail offline pre-training on large volumes of video datasets with accurate bounding box annotations that are labor-expensive to achieve. We present a new framework to facilitate bounding box annotations for video sequences, which investigates a selection-and-refinement strategy to automatically improve the preliminary annotations generated by tracking algorithms. A temporal assessment network (T-Assess Net) is proposed which is able to capture the temporal coherence of target locations and select reliable tracking results by measuring their quality. Meanwhile, a visual-geometry refinement network (VG-Refine Net) is also designed to further enhance the selected tracking results by considering both target appearance and temporal geometry constraints, allowing inaccurate tracking results to be corrected. The combination of the above two networks provides a principled approach to ensure the quality of automatic video annotation. Experiments on large scale tracking benchmarks demonstrate that our method can deliver highly accurate bounding box annotations and significantly reduce human labor by 94.0%, yielding an effective means to further boost tracking performance with augmented training data.
翻译:深入学习的视觉跟踪器需要就大量具有准确的捆绑框说明的视频数据集进行离线前培训,这些数据集需要花费大量人力才能实现。 我们提出了一个新的框架,以便利视频序列的捆绑框说明,该框架调查了自动改进跟踪算法生成的初步说明的筛选和精细战略。 提议了一个时间评估网络(T-Asess Net),它能够测量目标地点的时间一致性,并通过测量其质量来选择可靠的跟踪结果。 同时,还设计了一个视觉大地测量改进网络(VG-Refine Net),通过考虑目标外观和时间几何限制来进一步加强选定的跟踪结果,从而允许纠正不准确的跟踪结果。 以上两个网络的组合提供了一个原则性方法,以确保自动视频注释的质量。 大规模跟踪基准实验表明,我们的方法可以提供高度准确的捆绑绑框说明,并通过测量其质量而显著减少人类劳动力的94.0%,从而产生一种有效的手段,通过强化培训数据来进一步提高跟踪业绩。