Tracking by detection, the dominant approach for online multi-object tracking, alternates between localization and association steps. As a result, it strongly depends on the quality of instantaneous observations, often failing when objects are not fully visible. In contrast, tracking in humans is underlined by the notion of object permanence: once an object is recognized, we are aware of its physical existence and can approximately localize it even under full occlusions. In this work, we introduce an end-to-end trainable approach for joint object detection and tracking that is capable of such reasoning. We build on top of the recent CenterTrack architecture, which takes pairs of frames as input, and extend it to videos of arbitrary length. To this end, we augment the model with a spatio-temporal, recurrent memory module, allowing it to reason about object locations and identities in the current frame using all the previous history. It is, however, not obvious how to train such an approach. We study this question on a new, large-scale, synthetic dataset for multi-object tracking, which provides ground truth annotations for invisible objects, and propose several approaches for supervising tracking behind occlusions. Our model, trained jointly on synthetic and real data, outperforms the state of the art on KITTI and MOT17 datasets thanks to its robustness to occlusions.
翻译:检测跟踪,即在线多点跟踪的主要方法,即在线多点跟踪,以及定位和关联步骤之间的交替。结果,它在很大程度上取决于即时观测的质量,当物体不完全可见时往往会失败。相反,物体永久性的概念强调了人类的跟踪:一旦一个物体被识别,我们意识到它的实际存在,甚至可以在完全隔离的情况下将其大致本地化。在这项工作中,我们引入一种最终到最终的可培训的方法,用于联合物体检测和跟踪,从而能够进行这种推理。我们以最近的CentTrack结构为顶端,将一对框架作为投入,将其扩展至任意长度的视频。为此,我们用一个恒定时、反复的记忆模块来强化人体跟踪该模型,使该模型能够用所有以往的历史来解释当前框架中的物体位置和身份。然而,我们并不明显地如何培训这样一种方法。我们研究一个能够进行多点跟踪的新的、大规模合成数据集,它提供隐形物体的地面描述,并提议若干方法,用以监督跟踪隐形物体背后的合成数据,我们所训练的同步模型。