Tracking by natural language specification is a new rising research topic that aims at locating the target object in the video sequence based on its language description. Compared with traditional bounding box (BBox) based tracking, this setting guides object tracking with high-level semantic information, addresses the ambiguity of BBox, and links local and global search organically together. Those benefits may bring more flexible, robust and accurate tracking performance in practical scenarios. However, existing natural language initialized trackers are developed and compared on benchmark datasets proposed for tracking-by-BBox, which can't reflect the true power of tracking-by-language. In this work, we propose a new benchmark specifically dedicated to the tracking-by-language, including a large scale dataset, strong and diverse baseline methods. Specifically, we collect 2k video sequences (contains a total of 1,244,340 frames, 663 words) and split 1300/700 for the train/testing respectively. We densely annotate one sentence in English and corresponding bounding boxes of the target object for each video. We also introduce two new challenges into TNL2K for the object tracking task, i.e., adversarial samples and modality switch. A strong baseline method based on an adaptive local-global-search scheme is proposed for future works to compare. We believe this benchmark will greatly boost related researches on natural language guided tracking.
翻译:自然语言规格的跟踪是一个新的不断上升的研究课题,目的是根据语言描述将目标对象定位在视频序列中。与传统的基于语言描述的链接框(BBox)跟踪相比,这一设置将引导目标跟踪与高层次语义信息相匹配,解决BBox的模糊性,并将本地和全球搜索有机地连接起来。这些效益可能分别带来实际情景中更灵活、更有力和更准确的跟踪性能。然而,开发了现有的天然语言初始化跟踪跟踪跟踪器,并比较了为跟踪逐个BBox提议的基准数据集,该数据集无法反映逐个语言的真正能力。在这项工作中,我们提出了专门针对逐个语言跟踪的新基准,包括大型数据集、强而多样的基准方法。具体地说,我们收集了2k个视频序列(总共包含1,244,340个框架,663个字),并拆分了1,300/700个用于火车/测试。我们为每部视频的目标对象配置了一个注的句子和相应的约束框框。我们还向TNL2K提出了两个新的挑战,专门用于逐段逐段跟踪逐段,包括大型的大型数据集数据集数据集,我们为基于目标跟踪定位的模型,我们所建的建设基准模型,将相信一个基于全球基准模型的系统。