Scaling object taxonomies is one of the important steps toward a robust real-world deployment of recognition systems. We have faced remarkable progress in images since the introduction of the LVIS benchmark. To continue this success in videos, a new video benchmark, TAO, was recently presented. Given the recent encouraging results from both detection and tracking communities, we are interested in marrying those two advances and building a strong large vocabulary video tracker. However, supervisions in LVIS and TAO are inherently sparse or even missing, posing two new challenges for training the large vocabulary trackers. First, no tracking supervisions are in LVIS, which leads to inconsistent learning of detection (with LVIS and TAO) and tracking (only with TAO). Second, the detection supervisions in TAO are partial, which results in catastrophic forgetting of absent LVIS categories during video fine-tuning. To resolve these challenges, we present a simple but effective learning framework that takes full advantage of all available training data to learn detection and tracking while not losing any LVIS categories to recognize. With this new learning scheme, we show that consistent improvements of various large vocabulary trackers are capable, setting strong baseline results on the challenging TAO benchmarks.
翻译:缩放对象分类表是朝向强有力地实际部署识别系统迈出的重要一步之一。 自引入 LVIS 基准以来,我们在图像方面取得了显著进步。为了继续取得这一成功,最近推出了新的视频基准TAO。鉴于最近检测和跟踪社区的令人鼓舞的结果,我们有兴趣将这两个进展合并起来,并建立一个强大的大型词汇视频跟踪器。然而,LVIS和TAO的监管本来就很稀少,甚至缺失,给培训大型词汇跟踪器带来了两个新的挑战。第一,LVIS没有跟踪监督,这导致在检测(与 LVIS 和TAO)和跟踪(仅与TAO)方面学习不一致。第二,TAO的检测监督是部分的,导致在视频微调过程中灾难性地忘记了LVIS的类别。为了应对这些挑战,我们提出了一个简单而有效的学习框架,充分利用所有现有的培训数据来学习检测和跟踪,同时不丢失任何 LVIS 类别。我们展示的是,在这一新学习计划中,各种大型词汇跟踪器的不断改进是具有挑战性的基准结果。