Existing visual object tracking usually learns a bounding-box based template to match the targets across frames, which cannot accurately learn a pixel-wise representation, thereby being limited in handling severe appearance variations. To address these issues, much effort has been made on segmentation-based tracking, which learns a pixel-wise object-aware template and can achieve higher accuracy than bounding-box template based tracking. However, existing segmentation-based trackers are ineffective in learning the spatio-temporal correspondence across frames due to no use of the rich temporal information. To overcome this issue, this paper presents a novel segmentation-based tracking architecture, which is equipped with a spatio-appearance memory network to learn accurate spatio-temporal correspondence. Among it, an appearance memory network explores spatio-temporal non-local similarity to learn the dense correspondence between the segmentation mask and the current frame. Meanwhile, a spatial memory network is modeled as discriminative correlation filter to learn the mapping between feature map and spatial map. The appearance memory network helps to filter out the noisy samples in the spatial memory network while the latter provides the former with more accurate target geometrical center. This mutual promotion greatly boosts the tracking performance. Without bells and whistles, our simple-yet-effective tracking architecture sets new state-of-the-arts on the VOT2016, VOT2018, VOT2019, GOT-10K, TrackingNet, and VOT2020 benchmarks, respectively. Besides, our tracker outperforms the leading segmentation-based trackers SiamMask and D3S on two video object segmentation benchmarks DAVIS16 and DAVIS17 by a large margin. The source codes can be found at https://github.com/phiphiphi31/DMB.
翻译:现有视觉物体跟踪通常会学习一个基于捆绑盒的模板,以匹配跨框架的目标,这无法准确地学习像素-方法代表,因此在处理严重外观差异方面受到限制。为解决这些问题,在基于分层的跟踪方面做了大量的努力,该模块学习了像素-方法天体觉模板,并可以达到比基于捆绑盒的模板跟踪更高的准确性。然而,现有的基于分层的跟踪器在学习跨框架的spatio-时间2020的对应信息方面是无效的,因为没有使用丰富的时间信息。要克服这一问题,本文件展示了一个新的基于分层的跟踪结构,该结构在处理严重外观变异方面受到限制。 为了解决这些问题,本文展示了一个基于网络的基于链条的跟踪结构,这个基于线条的存储网络配备了一个基于线条的跟踪网络,用来学习精确的电路路路路路路标,S-S-S-S-S-S-S-轨道的跟踪网络,而后一个则提供前一州级的跟踪系统。一个空间记忆网络可以模拟地标,通过基于地貌/空间来源的图和空间图的图像的图像20格式进行测测测测测测测测测测。