Autonomous systems need to localize and track surrounding objects in 3D space for safe motion planning. As a result, 3D multi-object tracking (MOT) plays a vital role in autonomous navigation. Most MOT methods use a tracking-by-detection pipeline, which includes object detection and data association processing. However, many approaches detect objects in 2D RGB sequences for tracking, which is lack of reliability when localizing objects in 3D space. Furthermore, it is still challenging to learn discriminative features for temporally-consistent detection in different frames, and the affinity matrix is normally learned from independent object features without considering the feature interaction between detected objects in the different frames. To settle these problems, We firstly employ a joint feature extractor to fuse the 2D and 3D appearance features captured from both 2D RGB images and 3D point clouds respectively, and then propose a novel convolutional operation, named RelationConv, to better exploit the correlation between each pair of objects in the adjacent frames, and learn a deep affinity matrix for further data association. We finally provide extensive evaluation to reveal that our proposed model achieves state-of-the-art performance on KITTI tracking benchmark.
翻译:因此,三维多对象跟踪(MOT)在自主导航中发挥着至关重要的作用。大多数MOT方法都使用逐项跟踪管道,其中包括物体探测和数据关联处理。然而,许多方法都检测2D RGB序列中的天体进行跟踪,在3D空间对天体进行定位时缺乏可靠性。此外,在不同的框架中,了解用于时间一致探测的有区别性特征仍然很困难,亲近性矩阵通常从独立天体特征中学习,而不考虑不同框架中检测到的物体之间的特征相互作用。为了解决这些问题,我们首先使用一个联合地物提取器,将分别从2D RGB图像和3D点云中采集的2D和3D外观特征结合起来,然后提出一个新的革命操作,名为Relation Conv,以更好地利用相邻框架中的每对对象之间的关联性,并学习更深的亲近性矩阵,用于进一步的数据联系。我们最后提供了广泛的评价,以显示我们提议的模型实现了KTI的状态性能跟踪。