In this paper, we propose a new long video dataset (called Track Long and Prosper - TLP) and benchmark for visual object tracking. The dataset consists of 50 videos from real world scenarios, encompassing a duration of over 400 minutes (676K frames), making it more than 20 folds larger in average duration per sequence and more than 8 folds larger in terms of total covered duration, as compared to existing generic datasets for visual tracking. The proposed dataset paves a way to suitably assess long term tracking performance and possibly train better deep learning architectures (avoiding/reducing augmentation, which may not reflect realistic real world behavior). We benchmark the dataset on 17 state of the art trackers and rank them according to tracking accuracy and run time speeds. We further categorize the test sequences with different attributes and present a thorough quantitative and qualitative evaluation. Our most interesting observations are (a) existing short sequence benchmarks fail to bring out the inherent differences in tracking algorithms which widen up while tracking on long sequences and (b) the accuracy of most trackers abruptly drops on challenging long sequences, suggesting the potential need of research efforts in the direction of long term tracking.
翻译:在本文中,我们提出了一个新的长长视频数据集(称为长跟踪和Prosper-TLP)和视觉物体跟踪基准。该数据集由来自真实世界情景的50个视频组成,涵盖时间超过400分钟(676K框架),使每个序列的平均长度超过20个折叠,总长度超过8个折叠,与现有的用于视觉跟踪的通用数据集相比,覆盖时间超过20个折叠。与现有的通用数据集相比,拟议的数据集为适当评估长期跟踪性能并可能培训更好的深层次学习结构(鼓励/减少增强,这可能不反映现实的世界行为)铺平/降低强度铺设。我们把数据集以艺术追踪器的17个状态为基准,并按照跟踪准确性和运行时间速度排列数据顺序。我们进一步将测试序列分为不同的属性,并展示全面的定量和定性评价。我们最有趣的观察是:(a)现有的短序列基准无法消除跟踪算法的内在差异,这些算法在跟踪长序列的同时扩大,以及(b)大多数追踪器的准确性突然下降具有挑战性的长序列,表明长期跟踪工作方向的潜在需要。