For a long time, the most common paradigm in Multi-Object Tracking was tracking-by-detection (TbD), where objects are first detected and then associated over video frames. For association, most models resourced to motion and appearance cues, e.g., re-identification networks. Recent approaches based on attention propose to learn the cues in a data-driven manner, showing impressive results. In this paper, we ask ourselves whether simple good old TbD methods are also capable of achieving the performance of end-to-end models. To this end, we propose two key ingredients that allow a standard re-identification network to excel at appearance-based tracking. We extensively analyse its failure cases, and show that a combination of our appearance features with a simple motion model leads to strong tracking results. Our tracker generalizes to four public datasets, namely MOT17, MOT20, BDD100k, and DanceTrack, achieving state-of-the-art performance. We will release the code and models
翻译:长期以来,多目标跟踪中最常见的模式是跟踪逐个检测(TbD),首先检测对象,然后通过视频框架进行关联。对于关联关系,大多数模型资源都用于运动和外观提示,例如再识别网络。最近基于关注的方法建议以数据驱动的方式学习线索,显示令人印象深刻的结果。在本文中,我们自问简单而完善的旧的TbD方法是否也能够实现终端到终端模型的性能。为此,我们提出两个关键要素,允许标准再识别网络在外观跟踪上优异。我们广泛分析其失败案例,并显示我们外观特征与简单动作模型的组合导致强有力的跟踪结果。我们的追踪器一般地使用四个公共数据集,即MOT17、MOT20、BDD100k和DanceTrac,实现最先进的性能。我们将发布代码和模型。