The labels of monocular 3D object detection (M3OD) are expensive to obtain. Meanwhile, there usually exists numerous unlabeled data in practical applications, and pre-training is an efficient way of exploiting the knowledge in unlabeled data. However, the pre-training paradigm for M3OD is hardly studied. We aim to bridge this gap in this work. To this end, we first draw two observations: (1) The guideline of devising pre-training tasks is imitating the representation of the target task. (2) Combining depth estimation and 2D object detection is a promising M3OD pre-training baseline. Afterwards, following the guideline, we propose several strategies to further improve this baseline, which mainly include target guided semi-dense depth estimation, keypoint-aware 2D object detection, and class-level loss adjustment. Combining all the developed techniques, the obtained pre-training framework produces pre-trained backbones that improve M3OD performance significantly on both the KITTI-3D and nuScenes benchmarks. For example, by applying a DLA34 backbone to a naive center-based M3OD detector, the moderate ${\rm AP}_{3D}70$ score of Car on the KITTI-3D testing set is boosted by 18.71\% and the NDS score on the nuScenes validation set is improved by 40.41\% relatively.
翻译:单眼三维天体探测(M3OD)的标签非常昂贵。与此同时,在实际应用中通常有许多未贴标签的数据,培训前是利用未贴标签数据知识的一种有效方法,然而,对M3OD的培训前范式几乎未进行过研究。我们的目标是弥补这项工作中的这一差距。我们首先提出两点意见:(1) 设计培训前任务的指导方针正在模仿目标任务的表现。(2) 将深度估计和2D天体探测结合起来是一个有希望的M3OD培训前基线。随后,我们提出若干战略来进一步改进这一基线,主要包括定向半临界深度估计、关键点二维天体探测和等级损失调整。我们把所有开发技术结合起来,获得的培训前框架产生了预先训练骨干,大大改进了KITTI-341 D和nuScenes基准。例如,将DLA34骨架应用于以天性中中心为主的M3OD探测器。我们提出了几项战略,通过18-3DS级的升级测试标准,在KIS-MDS标准上采用中位的AR-3DS标准。