Recent works have shown that convolutional networks have substantially improved the performance of multiple object tracking by simultaneously learning detection and appearance features. However, due to the local perception of the convolutional network structure itself, the long-range dependencies in both the spatial and temporal cannot be obtained efficiently. To incorporate the spatial layout, we propose to exploit the local correlation module to model the topological relationship between targets and their surrounding environment, which can enhance the discriminative power of our model in crowded scenes. Specifically, we establish dense correspondences of each spatial location and its context, and explicitly constrain the correlation volumes through self-supervised learning. To exploit the temporal context, existing approaches generally utilize two or more adjacent frames to construct an enhanced feature representation, but the dynamic motion scene is inherently difficult to depict via CNNs. Instead, our paper proposes a learnable correlation operator to establish frame-to-frame matches over convolutional feature maps in the different layers to align and propagate temporal context. With extensive experimental results on the MOT datasets, our approach demonstrates the effectiveness of correlation learning with the superior performance and obtains state-of-the-art MOTA of 76.5% and IDF1 of 73.6% on MOT17.
翻译:最近的工程表明,革命网络通过同时学习探测和外观特征,大大改善了多物体跟踪的性能;然而,由于当地对革命网络结构本身的看法,无法有效地获得空间和时间方面的远距离依赖性;为了纳入空间布局,我们提议利用当地相关模块来模拟目标与其周围环境之间的地形关系,这可以加强我们模型在拥挤的场景中的差别力量;具体地说,我们为每个空间位置及其上下文建立了密集的通信,并通过自我监督的学习明确限制相关数量;为了利用时间环境,现有方法一般使用两个或两个以上相邻的框架来构建一个强化的地貌代表,但动态的场景本身很难通过CNN来描述。相反,我们的文件建议,可以学习的关联操作者在不同层次建立框架到框架的对框架的匹配关系,超越革命地貌地图,以协调和传播时间背景。我们的方法在MOT数据集上进行了广泛的实验结果,表明通过自我监督的学习与优异性学习的有效性,并获得了73.6%的MOTA和73.6%的USF.1%的MOTA状态。