进入单立体3D物体探测的训练前模型 (Delving into the Pre-training Paradigm of Monocular 3D Object Detection)

The labels of monocular 3D object detection (M3OD) are expensive to obtain. Meanwhile, there usually exists numerous unlabeled data in practical applications, and pre-training is an efficient way of exploiting the knowledge in unlabeled data. However, the pre-training paradigm for M3OD is hardly studied. We aim to bridge this gap in this work. To this end, we first draw two observations: (1) The guideline of devising pre-training tasks is imitating the representation of the target task. (2) Combining depth estimation and 2D object detection is a promising M3OD pre-training baseline. Afterwards, following the guideline, we propose several strategies to further improve this baseline, which mainly include target guided semi-dense depth estimation, keypoint-aware 2D object detection, and class-level loss adjustment. Combining all the developed techniques, the obtained pre-training framework produces pre-trained backbones that improve M3OD performance significantly on both the KITTI-3D and nuScenes benchmarks. For example, by applying a DLA34 backbone to a naive center-based M3OD detector, the moderate ${\rm AP}_{3D}70$ score of Car on the KITTI-3D testing set is boosted by 18.71\% and the NDS score on the nuScenes validation set is improved by 40.41\% relatively.

翻译：单眼三维天体探测(M3OD)的标签非常昂贵。与此同时,在实际应用中通常有许多未贴标签的数据,培训前是利用未贴标签数据知识的一种有效方法,然而,对M3OD的培训前范式几乎未进行过研究。我们的目标是弥补这项工作中的这一差距。我们首先提出两点意见:(1) 设计培训前任务的指导方针正在模仿目标任务的表现。(2) 将深度估计和2D天体探测结合起来是一个有希望的M3OD培训前基线。随后,我们提出若干战略来进一步改进这一基线,主要包括定向半临界深度估计、关键点二维天体探测和等级损失调整。我们把所有开发技术结合起来,获得的培训前框架产生了预先训练骨干,大大改进了KITTI-341 D和nuScenes基准。例如,将DLA34骨架应用于以天性中中心为主的M3OD探测器。我们提出了几项战略,通过18-3DS级的升级测试标准,在KIS-MDS标准上采用中位的AR-3DS标准。