In this research, we propose a new 3D object detector with a trustworthy depth estimation, dubbed BEVDepth, for camera-based Bird's-Eye-View (BEV) 3D object detection. By a thorough analysis of recent approaches, we discover that the depth estimation is implicitly learned without camera information, making it the de-facto fake-depth for creating the following pseudo point cloud. BEVDepth gets explicit depth supervision utilizing encoded intrinsic and extrinsic parameters. A depth correction sub-network is further introduced to counteract projecting-induced disturbances in depth ground truth. To reduce the speed bottleneck while projecting features from image-view into BEV using estimated depth, a quick view-transform operation is also proposed. Besides, our BEVDepth can be easily extended with input from multi-frame. Without any bells and whistles, BEVDepth achieves the new state-of-the-art 60.0% NDS on the challenging nuScenes test set while maintaining high efficiency. For the first time, the performance gap between the camera and LiDAR is largely reduced within 10% NDS.
翻译:在此研究中,我们提出一个新的三维对象探测器,其深度估计值得信赖,称为BEVDepth,用于基于摄像头的Bird's-Eye-View(BEV) 3D对象探测。通过对最新方法的透彻分析,我们发现深度估计是隐含的,没有摄像信息,因此它成为创建以下假点云的假的假深度。BEVDepth利用编码的内在和外部参数获得清晰的深度监督。深度校正子网络被进一步引入,以对抗深度地面真理中预测引起的扰动。为了减少速度瓶颈,同时利用估计深度从图像视图中投射到BEV,还提出了快速的视图转换操作。此外,我们的BEVDept可以很容易地利用多框架的投入加以扩展。如果没有任何钟和哨子,BEVDepth在具有挑战性的nuScenes测试设置上实现了新的60.0 % NDS,同时保持高效率。第一次,摄影机与LDAR之间的性差在10 % NDSDS内大大缩小。