Building 3D perception systems for autonomous vehicles that do not rely on high-density LiDAR is a critical research problem because of the expense of LiDAR systems compared to cameras and other sensors. Recent research has developed a variety of camera-only methods, where features are differentiably "lifted" from the multi-camera images onto the 2D ground plane, yielding a "bird's eye view" (BEV) feature representation of the 3D space around the vehicle. This line of work has produced a variety of novel "lifting" methods, but we observe that other details in the training setups have shifted at the same time, making it unclear what really matters in top-performing methods. We also observe that using cameras alone is not a real-world constraint, considering that additional sensors like radar have been integrated into real vehicles for years already. In this paper, we first of all attempt to elucidate the high-impact factors in the design and training protocol of BEV perception models. We find that batch size and input resolution greatly affect performance, while lifting strategies have a more modest effect -- even a simple parameter-free lifter works well. Second, we demonstrate that radar data can provide a substantial boost to performance, helping to close the gap between camera-only and LiDAR-enabled systems. We analyze the radar usage details that lead to good performance, and invite the community to re-consider this commonly-neglected part of the sensor platform.
翻译:为不依赖高密度激光雷达的自主车辆建造3D感知系统是一个关键的研究问题,因为与相机和其他传感器相比,LIDAR系统的费用比起相机和其他传感器的费用要高。最近的研究已经开发出各种只摄像头的方法,其特征与多摄像机图像不同,从多摄像机“提升”到2D地面平面,产生“鸟眼视”特征,显示车辆周围3D空间的设计和培训协议中高影响因素。这一工作线产生了各种新型的“提升”方法,但我们看到,培训设置中的其他细节也在同一时间发生变化,使得它不清楚最高性能方法的真正重要性。我们还注意到,仅使用相机并不是现实世界的制约因素,因为雷达等额外传感器已经被纳入实际飞行器已有多年了。在本文中,我们首先试图阐明BEV感知模型的设计和培训协议中的高影响因素。我们发现,批量规模和输入分辨率对绩效影响很大,而提升战略的效果更小一些,甚至简单无参数的提升器位提升了。我们发现,光学光谱平台能够很好地推进常规的雷达系统。我们展示了正常数据的运行。