In this paper, we propose M$^2$BEV, a unified framework that jointly performs 3D object detection and map segmentation in the Birds Eye View~(BEV) space with multi-camera image inputs. Unlike the majority of previous works which separately process detection and segmentation, M$^2$BEV infers both tasks with a unified model and improves efficiency. M$^2$BEV efficiently transforms multi-view 2D image features into the 3D BEV feature in ego-car coordinates. Such BEV representation is important as it enables different tasks to share a single encoder. Our framework further contains four important designs that benefit both accuracy and efficiency: (1) An efficient BEV encoder design that reduces the spatial dimension of a voxel feature map. (2) A dynamic box assignment strategy that uses learning-to-match to assign ground-truth 3D boxes with anchors. (3) A BEV centerness re-weighting that reinforces with larger weights for more distant predictions, and (4) Large-scale 2D detection pre-training and auxiliary supervision. We show that these designs significantly benefit the ill-posed camera-based 3D perception tasks where depth information is missing. M$^2$BEV is memory efficient, allowing significantly higher resolution images as input, with faster inference speed. Experiments on nuScenes show that M$^2$BEV achieves state-of-the-art results in both 3D object detection and BEV segmentation, with the best single model achieving 42.5 mAP and 57.0 mIoU in these two tasks, respectively.
翻译:在本文中,我们提出M$2$BEV,这是一个在鸟类眼视-(BEV)空间中以多相机图像输入进行3D对象探测和地图分割的统一框架。与以往大多数分别处理探测和分割过程的大多数工程不同,M$2$BEV用统一模型和提高效率来推断这两个任务。M$2$BEV高效地将多视图2D图像特性转换成3D BEV 自我车座坐标中的3D BEV 特征。这种BEV 表示方式很重要,因为它可以让不同的任务共享一个编码器。我们的框架还包含四个既有利于准确性和效率的重要设计:(1)高效的 BEV 编码设计,能够减少 voxel 特征地图的空间维度。(2) 动态框分配战略,利用学习到匹配来分配带锚的3D 3D 框。(3) BEVC 中心度调整模式,以更大的重量来强化更远的预测,(4) 大规模2D 检测前和辅助性监督。我们显示,这些设计大大地有利于以更高分辨率测量结果的图像速度。