Existing Visual Object Tracking (VOT) only takes the target area in the first frame as a template. This causes tracking to inevitably fail in fast-changing and crowded scenes, as it cannot account for changes in object appearance between frames. To this end, we revamped the tracking framework with Progressive Context Encoding Transformer Tracker (ProContEXT), which coherently exploits spatial and temporal contexts to predict object motion trajectories. Specifically, ProContEXT leverages a context-aware self-attention module to encode the spatial and temporal context, refining and updating the multi-scale static and dynamic templates to progressively perform accurate tracking. It explores the complementary between spatial and temporal context, raising a new pathway to multi-context modeling for transformer-based trackers. In addition, ProContEXT revised the token pruning technique to reduce computational complexity. Extensive experiments on popular benchmark datasets such as GOT-10k and TrackingNet demonstrate that the proposed ProContEXT achieves state-of-the-art performance.
翻译:渐进式上下文转换器(ProContEXT)用于跟踪
翻译摘要:
现有的视觉目标跟踪(VOT)只将第一帧中的目标区域作为模板。这会导致在快速变化和拥挤的场景中跟踪必然失败,因为它无法考虑帧之间目标外观的变化。为此,我们使用渐进式上下文编码变换跟踪器(ProContEXT)重新设计了跟踪框架,以协同利用空间和时间上下文来预测对象运动轨迹。具体来说,ProContEXT利用上下文感知的自我注意力模块来编码空间和时间上下文,逐步改进和更新多尺度静态和动态模板,逐步执行精确跟踪。它探索了空间和时间上下文之间的互补性,为基于变压器的跟踪器提供了一条新的多上下文建模路径。此外,ProContEXT修改了令牌修剪技术以减少计算复杂性。对流行的基准数据集(如GOT-10k和TrackingNet)进行的广泛实验表明,所提出的ProContEXT实现了最先进的性能。