Monocular 3D object detection reveals an economical but challenging task in autonomous driving. Recently center-based monocular methods have developed rapidly with a great trade-off between speed and accuracy, where they usually depend on the object center's depth estimation via 2D features. However, the visual semantic features without sufficient pixel geometry information, may affect the performance of clues for spatial 3D detection tasks. To alleviate this, we propose MonoPGC, a novel end-to-end Monocular 3D object detection framework with rich Pixel Geometry Contexts. We introduce the pixel depth estimation as our auxiliary task and design depth cross-attention pyramid module (DCPM) to inject local and global depth geometry knowledge into visual features. In addition, we present the depth-space-aware transformer (DSAT) to integrate 3D space position and depth-aware features efficiently. Besides, we design a novel depth-gradient positional encoding (DGPE) to bring more distinct pixel geometry contexts into the transformer for better object detection. Extensive experiments demonstrate that our method achieves the state-of-the-art performance on the KITTI dataset.
翻译:单体 3D 对象探测显示自主驾驶是一项经济但具有挑战性的任务。 最近以中心为基础的单体方法在速度和精确度之间有很大的权衡,迅速发展,在速度和精确度之间有很大的权衡,通常取决于天体中心的深度估计,但是,在没有足够像素几何信息的情况下,视觉语义特征可能影响空间 3D 探测任务线索的性能。为了减轻这一影响,我们建议MonorPGC, 这是一种新型的端到端单体 3D 对象探测框架, 具有丰富的像素几何环境。 我们采用像素深度估计,作为我们的辅助任务和设计深度交叉关注金字塔模块(DCPM), 将本地和全球深度几何测量知识注入视觉特征。 此外, 我们展示了深度空间- 卫星变异器(DSAT), 以高效地整合3D 空间位置和深度感测特征。 此外, 我们设计了一个新的深度定位定位调(DGPEPE), 以将更独特的像量测图环境引入变异器, 以更好地探测物体。 广泛实验显示我们的方法实现了KIT 的状态。