Tracking by detection, the dominant approach for online multi-object tracking, alternates between localization and re-identification steps. As a result, it strongly depends on the quality of instantaneous observations, often failing when objects are not fully visible. In contrast, tracking in humans is underlined by the notion of object permanence: once an object is recognized, we are aware of its physical existence and can approximately localize it even under full occlusions. In this work, we introduce an end-to-end trainable approach for joint object detection and tracking that is capable of such reasoning. We build on top of the recent CenterTrack architecture, which takes pairs of frames as input, and extend it to videos of arbitrary length. To this end, we augment the model with a spatio-temporal, recurrent memory module, allowing it to reason about object locations and identities in the current frame using all the previous history. It is, however, not obvious how to train such an approach. We study this question on a new, large-scale, synthetic dataset for multi-object tracking, which provides ground truth annotations for invisible objects, and propose several approaches for supervising tracking behind occlusions. Our model, trained jointly on synthetic and real data, outperforms the state of the art on KITTI, and MOT17 datasets thanks to its robustness to occlusions.
翻译:检测跟踪,即在线多点跟踪的主要方法,即在线多点跟踪、定位和再识别步骤之间的交替。结果,它在很大程度上取决于即时观测的质量,当物体不完全可见时,往往会失败。相反,物体永久性的概念强调了人类的跟踪:一旦一个物体得到承认,我们意识到它的实际存在,甚至可以在完全隔离的情况下将其大致本地化。在这项工作中,我们引入了能够进行这种推理的联合物体检测和跟踪的端到端可训练方法。我们在最新的CentralTrack结构上建立起来,该结构以一对框架作为输入,并将它扩大到任意长度的视频。为此,我们用一个静电时空、重复记忆模块来强化人类的跟踪模型,允许它用所有以往的历史来解释当前框架中的物体位置和身份。然而,如何培训这种方法并不明显。我们研究关于多点跟踪的新的、大规模、合成数据集的问题,它提供隐形物体的地面真相说明,并且将其扩展至任意长度的视频视频。为此,我们提出了几种方法,用以监督其真实性数据跟踪后面的合成数据。