TDIOT:深视频物体跟踪目标驱动推理 (TDIOT: Target-driven Inference for Deep Video Object Tracking)

Recent tracking-by-detection approaches use deep object detectors as target detection baseline, because of their high performance on still images. For effective video object tracking, object detection is integrated with a data association step performed by either a custom design inference architecture or an end-to-end joint training for tracking purpose. In this work, we adopt the former approach and use the pre-trained Mask R-CNN deep object detector as the baseline. We introduce a novel inference architecture placed on top of FPN-ResNet101 backbone of Mask R-CNN to jointly perform detection and tracking, without requiring additional training for tracking purpose. The proposed single object tracker, TDIOT, applies an appearance similarity-based temporal matching for data association. In order to tackle tracking discontinuities, we incorporate a local search and matching module into the inference head layer that exploits SiamFC for short term tracking. Moreover, in order to improve robustness to scale changes, we introduce a scale adaptive region proposal network that enables to search the target at an adaptively enlarged spatial neighborhood specified by the trace of the target. In order to meet long term tracking requirements, a low cost verification layer is incorporated into the inference architecture to monitor presence of the target based on its LBP histogram model. Performance evaluation on videos from VOT2016, VOT2018 and VOT-LT2018 datasets demonstrate that TDIOT achieves higher accuracy compared to the state-of-the-art short-term trackers while it provides comparable performance in long term tracking.

翻译：近期的跟踪跟踪方法使用深物体探测器作为目标检测基线,因为其高性能是静止图像上的高级性能。为了有效的视频物体跟踪,将物体探测与数据关联步骤结合起来,由定制设计推断结构或终端到终端联合跟踪培训来进行跟踪。在这项工作中,我们采用了前方法,并使用经过预先训练的面具R-CNN深物体探测器作为基线。我们还在FPN-ResNet101的FPN-ResNet101主干柱上安装了一个新的推论结构,以联合进行探测和跟踪,而无需为跟踪目的进行更多的培训。拟议的单一物体跟踪器(TDIOT)对数据关联采用表面相似的时间匹配。为了处理不连续性问题,我们采用了一个本地搜索和匹配模块,将SiamFC进行短期跟踪。此外,为了提高规模变化的稳健性,我们引入了一个规模调整区域建议网络,以便能够在目标跟踪所指定的适应性调整的扩展空间区段内搜索目标。为了达到长期跟踪要求,在长期跟踪中以类似性为基准的时,在长期跟踪中,在VOTVLS的运行中进行可比较性运行数据显示。