In this paper, we present a new tracking architecture with an encoder-decoder transformer as the key component. The encoder models the global spatio-temporal feature dependencies between target objects and search regions, while the decoder learns a query embedding to predict the spatial positions of the target objects. Our method casts object tracking as a direct bounding box prediction problem, without using any proposals or predefined anchors. With the encoder-decoder transformer, the prediction of objects just uses a simple fully-convolutional network, which estimates the corners of objects directly. The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing, thus largely simplifying existing tracking pipelines. The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running at real-time speed, being 6x faster than Siam R-CNN. Code and models are open-sourced at https://github.com/researchmm/Stark.
翻译:在本文中,我们展示了一个新的跟踪结构,其中以编码器脱coder-decoder变压器作为关键组成部分。编码器模型模拟了目标物体和搜索区域之间的全球时空特征依赖性,而编码器则学习了用于预测目标物体空间位置的查询嵌入器。我们的方法将对象跟踪作为一个直接捆绑的框预测问题,而没有使用任何建议或预设的锚。在编码器脱coder变压器中,对物体的预测只是使用一个简单的全演化网络,直接估计物体的角。整个方法都是端到端,不需要任何后处理步骤,例如对焦窗口和捆绑箱的滑动,从而在很大程度上简化了现有的跟踪管道。拟议跟踪器在五个具有挑战性的短期和长期基准上实现了最新状态的运行,同时实时运行速度比Siam R-CN.代码和模型更快6x,在 https://github.com/researmm/Stark上是开放源码和模型。