Generic Object Tracking (GOT) is the problem of tracking target objects, specified by bounding boxes in the first frame of a video. While the task has received much attention in the last decades, researchers have almost exclusively focused on the single object setting. Multi-object GOT benefits from a wider applicability, rendering it more attractive in real-world applications. We attribute the lack of research interest into this problem to the absence of suitable benchmarks. In this work, we introduce a new large-scale GOT benchmark, LaGOT, containing multiple annotated target objects per sequence. Our benchmark allows users to tackle key remaining challenges in GOT, aiming to increase robustness and reduce computation through joint tracking of multiple objects simultaneously. In addition, we propose a transformer-based GOT tracker baseline capable of joint processing of multiple objects through shared computation. Our approach achieves a 4x faster run-time in case of 10 concurrent objects compared to tracking each object independently and outperforms existing single object trackers on our new benchmark. In addition, our approach achieves highly competitive results on single-object GOT datasets, setting a new state of the art on TrackingNet with a success rate AUC of 84.4%. Our benchmark, code, and trained models will be made publicly available.
翻译:通用物体跟踪(GOT)是指在视频的第一帧中,跟踪由边界框定义的目标物体的问题。虽然这个任务在过去几十年中已经得到了很多关注,但研究人员几乎完全集中于单个目标的设置。多目标GOT有更广泛的适用性,使它在实际应用中更具吸引力。我们认为缺乏对这个问题的研究兴趣是因为没有合适的基准。在这项工作中,我们引入了一个新的大规模GOT基准,称为 LaGOT,其中包含每个序列多个注释的目标对象。我们的基准允许用户解决GOT中的关键挑战,旨在通过同时跟踪多个对象来增加鲁棒性和减少计算量。此外,我们提出了一个基于Transformer的GOT跟踪器基线,能够通过共享计算联合处理多个对象。相比于独立跟踪每个对象,我们的方法在10个并发对象的情况下实现了4倍的运行时间,并在我们的新基准上优于现有的单个物体跟踪器。此外,我们的方法在单目标GOT数据集上实现了非常有竞争力的结果,在TrackingNet上的成功率AUC为84.4%,创下了新的最高水平。我们的基准、代码和训练模型将公开提供。