Robust object tracking requires knowledge of tracked objects' appearance, motion and their evolution over time. Although motion provides distinctive and complementary information especially for fast moving objects, most of the recent tracking architectures primarily focus on the objects' appearance information. In this paper, we propose a two-stream deep neural network tracker that uses both spatial and temporal features. Our architecture is developed over ATOM tracker and contains two backbones: (i) 2D-CNN network to capture appearance features and (ii) 3D-CNN network to capture motion features. The features returned by the two networks are then fused with attention based Feature Aggregation Module (FAM). Since the whole architecture is unified, it can be trained end-to-end. The experimental results show that the proposed tracker TRAT (TRacking by ATtention) achieves state-of-the-art performance on most of the benchmarks and it significantly outperforms the baseline ATOM tracker.
翻译:强力物体跟踪要求了解跟踪物体的外观、运动及其随时间演变情况。虽然运动提供了独特和互补的信息,特别是针对快速移动的物体,但最近的跟踪结构大多主要侧重于物体的外观信息。在本文中,我们提出了使用空间和时间特征的双流深神经网络跟踪器。我们的架构是通过ATOM跟踪器开发的,包含两个主干线:(一) 2D-CNN 网络以捕捉外观特征,(二) 3D-CNN 网络以捕捉运动特征。这两个网络返回的功能随后与基于关注的地貌聚合模块(FAM)相结合。由于整个结构是统一的,它可以经过培训的端到端。实验结果显示,拟议的跟踪器TRAT(由注意进行TRAT)在大多数基准上达到最新性表现,并且大大超出基准的ATOM跟踪器。