The ability to recognize, localize and track dynamic objects in a scene is fundamental to many real-world applications, such as self-driving and robotic systems. Yet, traditional multiple object tracking (MOT) benchmarks rely only on a few object categories that hardly represent the multitude of possible objects that are encountered in the real world. This leaves contemporary MOT methods limited to a small set of pre-defined object categories. In this paper, we address this limitation by tackling a novel task, open-vocabulary MOT, that aims to evaluate tracking beyond pre-defined training categories. We further develop OVTrack, an open-vocabulary tracker that is capable of tracking arbitrary object classes. Its design is based on two key ingredients: First, leveraging vision-language models for both classification and association via knowledge distillation; second, a data hallucination strategy for robust appearance feature learning from denoising diffusion probabilistic models. The result is an extremely data-efficient open-vocabulary tracker that sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark, while being trained solely on static images. Project page: https://www.vis.xyz/pub/ovtrack/
翻译:能够识别、定位和跟踪场景中的动态物体对于许多现实世界的应用,如自动驾驶和机器人系统,至关重要。然而,传统的多目标跟踪(MOT)基准仅依赖于少数物体类别,这些类别很难代表实际世界中可能遇到的众多物体。这使得当代MOT方法仅限于一个小型预定义对象类别集合。在本文中,我们通过处理一项新颖的任务,即开放词汇MOT,来解决这一限制,旨在对超过预定义培训类别的跟踪进行评估。我们进一步开发了OVTrack,一种开放词汇跟踪器,它能够跟踪任意物体类别。其设计基于两个关键因素:首先,通过知识蒸馏利用视觉-语言模型进行分类和关联;其次,利用数据虚构策略从去噪扩散概率模型中进行稳健的外观特征学习。结果是一种极其数据高效的开放词汇跟踪器,在仅使用静态图像进行训练的情况下,在大规模、大词汇的TAO基准测试中创下了新的最新成果。项目页面:https://www.vis.xyz/pub/ovtrack/