Recently, one-stage trackers that use a joint model to predict both detections and appearance embeddings in one forward pass received much attention and achieved state-of-the-art results on the Multi-Object Tracking (MOT) benchmarks. However, their success depends on the availability of videos that are fully annotated with tracking data, which is expensive and hard to obtain. This can limit the model generalization. In comparison, the two-stage approach, which performs detection and embedding separately, is slower but easier to train as their data are easier to annotate. We propose to combine the best of the two worlds through a data distillation approach. Specifically, we use a teacher embedder, trained on Re-ID datasets, to generate pseudo appearance embedding labels for the detection datasets. Then, we use the augmented dataset to train a detector that is also capable of regressing these pseudo-embeddings in a fully-convolutional fashion. Our proposed one-stage solution matches the two-stage counterpart in quality but is 3 times faster. Even though the teacher embedder has not seen any tracking data during training, our proposed tracker achieves competitive performance with some popular trackers (e.g. JDE) trained with fully labeled tracking data.
翻译:最近,一个阶段跟踪器使用一个联合模型来预测发现和外观嵌入一个前方传球的情况,这一阶段跟踪器在多目标跟踪(MOT)基准方面引起了人们的极大关注,并取得了最先进的成果。然而,其成功取决于是否有带跟踪数据的充分附加说明的视频,这些数据成本昂贵且难以获取。这可以限制模型的概括化。相比之下,两阶段方法(分别进行检测和嵌入)比较慢,但培训容易,因为其数据更容易被注解。我们提议通过数据蒸馏方法将两个世界中最好的组合起来。具体地说,我们使用一个师级嵌入器,接受过重新识别数据集培训,为检测数据集制作假外观嵌入标签。然后,我们用强化的数据集来训练一个也能够以完全革命的方式再退缩这些假干扰的探测器。我们提议的一阶段解决方案在质量上与两阶段对应方相匹配,但速度要快3倍。尽管教师嵌入器没有看到任何经过过测试的跟踪功能,但我们提议的跟踪跟踪器实现了。