We present TrackFormer, an end-to-end multi-object tracking and segmentation model based on an encoder-decoder Transformer architecture. Our approach introduces track query embeddings which follow objects through a video sequence in an autoregressive fashion. New track queries are spawned by the DETR object detector and embed the position of their corresponding object over time. The Transformer decoder adjusts track query embeddings from frame to frame, thereby following the changing object positions. TrackFormer achieves a seamless data association between frames in a new tracking-by-attention paradigm by self- and encoder-decoder attention mechanisms which simultaneously reason about location, occlusion, and object identity. TrackFormer yields state-of-the-art performance on the tasks of multi-object tracking (MOT17) and segmentation (MOTS20). We hope our unified way of performing detection and tracking will foster future research in multi-object tracking and video understanding. Code will be made publicly available.
翻译:我们推出基于编码器-编码器变异器结构的终端到终端多对象跟踪和分解模型ChartFormer。 我们的方法是引入跟踪查询嵌入器,以自动递增的方式通过视频序列跟踪物体。 新的跟踪查询器由DETR天体探测器生成,并随时间嵌入相应物体的位置。 变换器解码器调整框架到框架的跟踪嵌入器, 从而跟踪变化的物体位置。 跟踪器通过自我和编码分解器的注意机制在新的逐个跟踪模式中实现各框架之间的无缝数据连接, 并同时解释位置、 闭合和对象身份。 Trackermer 生成多点跟踪(MOT17) 和分解(MOTS20) 任务的最新表现。 我们希望我们统一的检测和跟踪方法将促进多点跟踪跟踪和视频理解的未来研究。 代码将公布于众。