Most modern multiple object tracking (MOT) systems follow the tracking-by-detection paradigm, consisting of a detector followed by a method for associating detections into tracks. There is a long history in tracking of combining motion and appearance features to provide robustness to occlusions and other challenges, but typically this comes with the trade-off of a more complex and slower implementation. Recent successes on popular 2D tracking benchmarks indicate that top-scores can be achieved using a state-of-the-art detector and relatively simple associations relying on single-frame spatial offsets -- notably outperforming contemporary methods that leverage learned appearance features to help re-identify lost tracks. In this paper, we propose an efficient joint detection and tracking model named DEFT, or "Detection Embeddings for Tracking." Our approach relies on an appearance-based object matching network jointly-learned with an underlying object detection network. An LSTM is also added to capture motion constraints. DEFT has comparable accuracy and speed to the top methods on 2D online tracking leaderboards while having significant advantages in robustness when applied to more challenging tracking data. DEFT raises the bar on the nuScenes monocular 3D tracking challenge, more than doubling the performance of the previous top method. Code is publicly available.
翻译:多数现代多天体跟踪系统(MOT)大多数现代多天体跟踪系统都遵循跟踪逐次检测模式,其中包括探测器,并辅之以将探测探测结果连接到轨道的方法。在跟踪运动和外观特征的结合方面,有着很长的历史,以提供隔离和其他挑战的稳健性,但通常情况下,这要与更为复杂和较慢的执行过程相权衡。流行的 2D 跟踪基准最近的成功表明,顶级分数可以使用最先进的探测器和相对简单的依靠单一框架空间偏移的组合来实现,特别是优于利用学习外观特征帮助重新识别丢失轨道的当代方法。在本文中,我们建议采用一个名为DEFT的高效联合探测和跟踪模型,或“跟踪探测嵌入式模型”。 我们的方法依赖于基于外观的天体匹配网络,与一个基本天体探测网络进行联合学习。LSTM还添加了捕捉动作限制。DEFT具有与2D在线跟踪头板上顶级方法的相似的准确性和速度,同时在应用更具有强性能优势的当代方法,在更具有挑战性的跟踪数据上,D DEFTFT在前的顶级工具上提出了比前的顶级标准。