双模式本地化的高性能视觉跟踪 (Higher Performance Visual Tracking with Dual-Modal Localization)

Visual Object Tracking (VOT) has synchronous needs for both robustness and accuracy. While most existing works fail to operate simultaneously on both, we investigate in this work the problem of conflicting performance between accuracy and robustness. We first conduct a systematic comparison among existing methods and analyze their restrictions in terms of accuracy and robustness. Specifically, 4 formulations-offline classification (OFC), offline regression (OFR), online classification (ONC), and online regression (ONR)-are considered, categorized by the existence of online update and the types of supervision signal. To account for the problem, we resort to the idea of ensemble and propose a dual-modal framework for target localization, consisting of robust localization suppressing distractors via ONR and the accurate localization attending to the target center precisely via OFC. To yield a final representation (i.e, bounding box), we propose a simple but effective score voting strategy to involve adjacent predictions such that the final representation does not commit to a single location. Operating beyond the real-time demand, our proposed method is further validated on 8 datasets-VOT2018, VOT2019, OTB2015, NFS, UAV123, LaSOT, TrackingNet, and GOT-10k, achieving state-of-the-art performance.

翻译：视觉物体跟踪(VOT)具有同步的稳健性和准确性需求。虽然大多数现有工程无法同时运行,但我们在这项工作中调查了准确性和稳健性之间性能冲突的问题。我们首先对现有方法进行系统比较,并分析其准确性和稳健性方面的限制。具体地说,4种配方-脱线分类(OFC)、离线回归(OFR)、在线分类(ONC)和在线回归(ONR)是考虑的,按存在在线更新和监管信号的类型分类。为了解决这个问题,我们采用了联合概念,并提出了目标本地化的双重模式框架,其中包括通过ONR对分散器进行强有力的本地化抑制,以及准确的本地化通过OFC对目标中心进行精确的本地化。为了产生最终的表述(即,绑定框),我们提议了一个简单而有效的计票战略,以包含相邻的预测,从而最终代表不会承诺到一个单一地点。在实时需求之外,我们提出的方法将在8个数据集-VOT20,VOT20,OT-OT19,OT-10FS,OTA,实现运行状态。