Current geometry-based monocular 3D object detection models can efficiently detect objects by leveraging perspective geometry, but their performance is limited due to the absence of accurate depth information. Though this issue can be alleviated in a depth-based model where a depth estimation module is plugged to predict depth information before 3D box reasoning, the introduction of such module dramatically reduces the detection speed. Instead of training a costly depth estimator, we propose a rendering module to augment the training data by synthesizing images with virtual-depths. The rendering module takes as input the RGB image and its corresponding sparse depth image, outputs a variety of photo-realistic synthetic images, from which the detection model can learn more discriminative features to adapt to the depth changes of the objects. Besides, we introduce an auxiliary module to improve the detection model by jointly optimizing it through a depth estimation task. Both modules are working in the training time and no extra computation will be introduced to the detection model. Experiments show that by working with our proposed modules, a geometry-based model can represent the leading accuracy on the KITTI 3D detection benchmark.
翻译:目前基于几何的单眼3D物体探测模型可以通过利用视野几何来有效探测物体,但由于缺乏准确的深度信息,其性能有限。虽然这个问题可以在深度模型中缓解,因为深度估计模块在3D箱推理之前插入,以预测深度信息,但采用这种模块会大大降低探测速度。我们建议了一个配置模块,通过将图像与虚拟深度合成,来增加培训数据。 投影模块将RGB图像及其相应的稀薄深度图像作为输入,并产生各种照片现实合成图像,通过这些图像,探测模型可以学习更具有歧视性的特征,以适应物体的深度变化。此外,我们引入了一个辅助模块,通过深度估计任务共同优化检测模型。这两个模块都在培训时间工作,不会对探测模型进行额外计算。实验显示,与我们提议的模块合作,基于几何模型可以代表KITTI 3D探测基准的主要精度。