Recent research in multi-task learning reveals the benefit of solving related problems in a single neural network. 3D object detection and multi-object tracking (MOT) are two heavily intertwined problems predicting and associating an object instance location across time. However, most previous works in 3D MOT treat the detector as a preceding separated pipeline, disjointly taking the output of the detector as an input to the tracker. In this work, we present Minkowski Tracker, a sparse spatio-temporal R-CNN that jointly solves object detection and tracking. Inspired by region-based CNN (R-CNN), we propose to solve tracking as a second stage of the object detector R-CNN that predicts assignment probability to tracks. First, Minkowski Tracker takes 4D point clouds as input to generate a spatio-temporal Bird's-eye-view (BEV) feature map through a 4D sparse convolutional encoder network. Then, our proposed TrackAlign aggregates the track region-of-interest (ROI) features from the BEV features. Finally, Minkowski Tracker updates the track and its confidence score based on the detection-to-track match probability predicted from the ROI features. We show in large-scale experiments that the overall performance gain of our method is due to four factors: 1. The temporal reasoning of the 4D encoder improves the detection performance 2. The multi-task learning of object detection and MOT jointly enhances each other 3. The detection-to-track match score learns implicit motion model to enhance track assignment 4. The detection-to-track match score improves the quality of the track confidence score. As a result, Minkowski Tracker achieved the state-of-the-art performance on Nuscenes dataset tracking task without hand-designed motion models.
翻译:多任务学习的近期研究揭示了在单一神经网络中解决相关问题的好处。 3D天体探测和多目标跟踪(MOT)是两个紧密交织的问题。 但是, 3D天体探测和多目标跟踪(MOT) 多数先前的作品将探测器作为前一个分离管道处理, 将探测器的输出脱节地作为向跟踪器的输入。 在这项工作中, 我们展示了 Minkowski 跟踪器, 一个稀疏的瞬时空 R- CNN, 共同解决天体探测和跟踪。 在基于区域的CNN (R- CNN) 的启发下, 我们提议解决作为预测天体探测器 R- CNN 的第二个阶段, 预测天体镜点点点点点点定位。 首先, Minkowski tracker 将4D点云层探测器的输出作为输入输入, 通过一个4Dlent convolution 模型(BEV) 实现星系检测结果。 然后, 我们提议的ContraAl- col- cal- colon- cental suble- cre- droud the demode road road road roud the roud the roud the roud the roud roud roud the rocal roud the roud the 4 rois the rout the rout the rout the rout the rout the rod roudt roud rodal rodal rocal roud roud roud rod rod rod rod rod rod rod rod rodal rod rod rodal rocal rocal rocal rod rod rod rod rod rodal rodal rodald rod rod rod rod rod rod rod rod rod rod rod rod rod rodal rod rod rodaldaldaldaldal ro) romodal rodal