3D object detection with surrounding cameras has been a promising direction for autonomous driving. In this paper, we present SimMOD, a Simple baseline for Multi-camera Object Detection, to solve the problem. To incorporate multi-view information as well as build upon previous efforts on monocular 3D object detection, the framework is built on sample-wise object proposals and designed to work in a two-stage manner. First, we extract multi-scale features and generate the perspective object proposals on each monocular image. Second, the multi-view proposals are aggregated and then iteratively refined with multi-view and multi-scale visual features in the DETR3D-style. The refined proposals are end-to-end decoded into the detection results. To further boost the performance, we incorporate the auxiliary branches alongside the proposal generation to enhance the feature learning. Also, we design the methods of target filtering and teacher forcing to promote the consistency of two-stage training. We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD and achieve new state-of-the-art performance. Code will be available at https://github.com/zhangyp15/SimMOD.
翻译:使用周围摄像头探测 3D 对象是一个很有希望的自主驾驶方向。 在本文中, 我们展示了SimMOD, 这是多镜头物体探测的简单基线, 以解决问题。 为了纳入多视图信息, 并在先前单立体3D物体探测工作的基础上再接再厉, 框架建立在抽样对象建议上, 并设计以两阶段方式发挥作用。 首先, 我们提取多尺度的特性, 并在每张单体图像上生成视角对象探测建议。 第二, 多视图建议会集成, 然后以多视图和多尺度的视觉特征进行迭接式改进。 改进后的建议会在检测结果中进行端到端的解码。 为了进一步提高性能, 我们把辅助分支与生成的建议结合起来, 以加强特征学习。 此外, 我们设计目标过滤和教师强迫方法, 以促进两阶段培训的一致性。 我们将在 https://githusub.