Multi-object tracking (MOT) in videos remains challenging due to complex object motions and crowded scenes. Recent DETR-based frameworks offer end-to-end solutions but typically process detection and tracking queries jointly within a single Transformer Decoder layer, leading to conflicts and degraded association accuracy. We introduce the Motion-Aware Transformer (MATR), which explicitly predicts object movements across frames to update track queries in advance. By reducing query collisions, MATR enables more consistent training and improves both detection and association. Extensive experiments on DanceTrack, SportsMOT, and BDD100k show that MATR delivers significant gains across standard metrics. On DanceTrack, MATR improves HOTA by more than 9 points over MOTR without additional data and reaches a new state-of-the-art score of 71.3 with supplementary data. MATR also achieves state-of-the-art results on SportsMOT (72.2 HOTA) and BDD100k (54.7 mTETA, 41.6 mHOTA) without relying on external datasets. These results demonstrate that explicitly modeling motion within end-to-end Transformers offers a simple yet highly effective approach to advancing multi-object tracking.
翻译:视频中的多目标跟踪(MOT)由于复杂的物体运动和拥挤场景而仍然具有挑战性。近年来基于DETR的框架提供了端到端的解决方案,但通常在单个Transformer解码器层中联合处理检测与跟踪查询,导致冲突并降低关联精度。我们提出了运动感知Transformer(MATR),它显式地预测跨帧的物体运动以提前更新轨迹查询。通过减少查询冲突,MATR实现了更一致的训练,并同时提升了检测与关联性能。在DanceTrack、SportsMOT和BD100k上的大量实验表明,MATR在各项标准指标上均取得了显著提升。在DanceTrack上,MATR在不使用额外数据的情况下比MOTR将HOTA提高了9分以上,并在补充数据支持下达到了71.3的最新最优成绩。MATR还在SportsMOT(72.2 HOTA)和BDD100k(54.7 mTETA,41.6 mHOTA)上实现了不依赖外部数据集的最新最优结果。这些结果表明,在端到端Transformer中显式建模运动为推进多目标跟踪提供了一种简单而高效的方法。