BEVDET: 鸟眼观察中高性能多镜头3D物体探测 (BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View)

Autonomous driving perceives the surrounding environment for decision making, which is one of the most complicated scenes for visual perception. The great power of paradigm innovation in solving the 2D object detection task inspires us to seek an elegant, feasible, and scalable paradigm for pushing the performance boundary in this area. To this end, we contribute the BEVDet paradigm in this paper. BEVDet is developed by following the principle of detecting the 3D objects in Bird-Eye-View (BEV), where route planning can be handily performed. In this paradigm, four kinds of modules are conducted in succession with different roles: an image-view encoder for encoding feature in image view, a view transformer for feature transformation from image view to BEV, a BEV encoder for further encoding feature in BEV, and a task-specific head for predicting the targets in BEV. We merely reuse the existing modules for constructing BEVDet and make it feasible for multi-camera 3D object detection by constructing an exclusive data augmentation strategy. The proposed paradigm works well in multi-camera 3D object detection and offers a good trade-off between computing budget and performance. BEVDet with 704x256 (1/8 of the competitors) image size scores 29.4% mAP and 38.4% NDS on the nuScenes val set, which is comparable with FCOS3D (i.e., 2008.2 GFLOPs, 1.7 FPS, 29.5% mAP and 37.2% NDS), while requires merely 12% computing budget of 239.4 GFLOPs and runs 4.3 times faster. Scaling up the input size to 1408x512, BEVDet scores 34.9% mAP, and 41.7% NDS, which requires just 601.4 GFLOPs and significantly suppresses FCOS3D by 5.4% mAP and 4.5% NDS. The superior performance of BEVDet tells the magic of paradigm innovation.

翻译：自动驱动能感知到周围的决策环境,这是最复杂的视觉感知场景之一。在解决 2D 对象检测任务时,范式创新的巨大力量激励我们寻找一个优雅、可行和可缩放的范式来推推推推这个区域的业绩界限。为此,我们贡献了本文中的 BEVDet 范式。 BEVDet 是根据在Bird-Eye-View(BEV) 中检测 3D 对象的原则来开发的。在这个范式中,可以轻松地进行路线规划。在这种范式中,四种模块接连地运行着不同的角色: 在图像视图中,一个图像-DED 编码的编码编码编码器,一个从图像到 BEVE, 一个BEVDDD 的精度转换器。我们仅仅再利用现有的模块来构建 BEVDet, 通过构建一个专用数据增强战略,使多镜头的3D 3D 对象检测成为可行。拟议的范式在多盘 3D 3D 对象检测中运行得很好, 3D 3D 的功能检测和提供2008 VEVD 4S 10D 和 VED 10D 的平比 1.04 和40 预算比 1. 预算级的平比 1. 。