The challenging task of multi-object tracking (MOT) requires simultaneous reasoning about track initialization, identity, and spatiotemporal trajectories. We formulate this task as a frame-to-frame set prediction problem and introduce TrackFormer, an end-to-end MOT approach based on an encoder-decoder Transformer architecture. Our model achieves data association between frames via attention by evolving a set of track predictions through a video sequence. The Transformer decoder initializes new tracks from static object queries and autoregressively follows existing tracks in space and time with the new concept of identity preserving track queries. Both decoder query types benefit from self- and encoder-decoder attention on global frame-level features, thereby omitting any additional graph optimization and matching or modeling of motion and appearance. TrackFormer represents a new tracking-by-attention paradigm and yields state-of-the-art performance on the task of multi-object tracking (MOT17) and segmentation (MOTS20). The code is available at https://github.com/timmeinhardt/trackformer .
翻译:多弹道跟踪(MOT)的任务具有挑战性,需要同时推理跟踪初始化、身份和时空轨迹。我们将此任务设计成一个框架到框架的预测问题,并引入基于编码器脱co器变形器结构的终端到终端的MOT方法ChackFormer。我们的模型通过通过通过视频序列开发一套跟踪预测,通过关注实现各框架之间的数据联系。变换器解码器初始化了静态物体查询和自动递归式查询中的新轨道,沿空间和时间的现有轨道跟踪了身份保存轨迹查询的新概念。脱coder查询类型都得益于全球框架层面的自我和编码解码器脱形注意,从而省略了任何额外的图形优化以及动作和外观的匹配或建模。 TrackFormer是一个新的逐位跟踪模式,并产生了多弹道跟踪任务(MOT17)和分解(MOTS20)的状态表现。该代码可在https://github.com/timinthard/tradestrat/sexion上查阅。