The challenging task of multi-object tracking (MOT) requires simultaneous reasoning about track initialization, identity, and spatio-temporal trajectories. We formulate this task as a frame-to-frame set prediction problem and introduce TrackFormer, an end-to-end trainable MOT approach based on an encoder-decoder Transformer architecture. Our model achieves data association between frames via attention by evolving a set of track predictions through a video sequence. The Transformer decoder initializes new tracks from static object queries and autoregressively follows existing tracks in space and time with the conceptually new and identity preserving track queries. Both query types benefit from self- and encoder-decoder attention on global frame-level features, thereby omitting any additional graph optimization or modeling of motion and/or appearance. TrackFormer introduces a new tracking-by-attention paradigm and while simple in its design is able to achieve state-of-the-art performance on the task of multi-object tracking (MOT17 and MOT20) and segmentation (MOTS20). The code is available at https://github.com/timmeinhardt/trackformer .
翻译:多弹道跟踪(MOT)的艰巨任务要求同时推理轨道初始化、身份和时空轨迹。我们将此任务设计成一个框架到框架的预测问题,并采用基于编码器脱coder变形器结构的终端到终端可培训的MOT方法ChackFormer。我们的模型通过通过通过视频序列开发一套轨道预测来通过注意力实现各框架之间的数据联系。变换器解码器初始化了静态物体查询的新轨道,自动递增了在空间和时间上现有的轨道,并采用了新的概念和身份保存轨道查询。两种查询类型都得益于全球框架层面的自我和编码脱形器的注意,从而省略了任何额外的图形优化或运动和/或外观的模型。ChackFormerer引入了新的逐位跟踪模式,其设计简便能够实现多弹道跟踪任务(MOT17和MOT20)和分解(MOT20)的状态性业绩。该代码可在 https://gimphinst/sthard轨道上查到。