Existing Visual Object Tracking (VOT) only takes the target area in the first frame as a template. This causes tracking to inevitably fail in fast-changing and crowded scenes, as it cannot account for changes in object appearance between frames. To this end, we revamped the tracking framework with Progressive Context Encoding Transformer Tracker (ProContEXT), which coherently exploits spatial and temporal contexts to predict object motion trajectories. Specifically, ProContEXT leverages a context-aware self-attention module to encode the spatial and temporal context, refining and updating the multi-scale static and dynamic templates to progressively perform accurate tracking. It explores the complementary between spatial and temporal context, raising a new pathway to multi-context modeling for transformer-based trackers. In addition, ProContEXT revised the token pruning technique to reduce computational complexity. Extensive experiments on popular benchmark datasets such as GOT-10k and TrackingNet demonstrate that the proposed ProContEXT achieves state-of-the-art performance.
翻译:现有视觉物体跟踪(VOT)仅将第一个框架的目标区域作为模板。 这导致跟踪在快速变化和拥挤的场景中不可避免地失败,因为它无法说明框架之间对象外观的变化。 为此,我们用进步环境编码变异跟踪器(ProContEXT)更新了跟踪框架,该跟踪器一致利用空间和时间环境来预测物体运动轨迹。 具体地说, ProContEXT利用一个有上下文意识的自我注意模块来编码空间和时间背景,改进和更新多尺度的静态和动态模板以逐步进行准确跟踪。 它探索了空间和时间环境之间的互补性,为基于变异器的跟踪器的多文本建模提出了一条新的路径。 此外, ProContEXT还修订了象征性的调整技术以降低计算复杂性。 在流行的基准数据集(如MTO-10k和跟踪Net)上的广泛实验表明,拟议的ProContEXS实现了最新性表现。