Autonomous driving requires an accurate and fast 3D perception system that includes 3D object detection, tracking, and segmentation. Although recent low-cost camera-based approaches have shown promising results, they are susceptible to poor illumination or bad weather conditions and have a large localization error. Hence, fusing camera with low-cost radar, which provides precise long-range measurement and operates reliably in all environments, is promising but has not yet been thoroughly investigated. In this paper, we propose Camera Radar Net (CRN), a novel camera-radar fusion framework that generates a semantically rich and spatially accurate bird's-eye-view (BEV) feature map for various tasks. To overcome the lack of spatial information in an image, we transform perspective view image features to BEV with the help of sparse but accurate radar points. We further aggregate image and radar feature maps in BEV using multi-modal deformable attention designed to tackle the spatial misalignment between inputs. CRN with real-time setting operates at 20 FPS while achieving comparable performance to LiDAR detectors on nuScenes, and even outperforms at a far distance on 100m setting. Moreover, CRN with offline setting yields 62.4% NDS, 57.5% mAP on nuScenes test set and ranks first among all camera and camera-radar 3D object detectors.
翻译:自动驾驶需要一个精准快速的三维感知系统,其中包括三维物体检测、跟踪和分割。尽管最近的低成本基于摄像机的方法显示出了很有前途的结果,但它们容易受到不良照明或恶劣天气条件的影响,并且具有大的定位误差。因此,将摄像机与低成本的雷达融合,这种方式不仅提供精确的远程测量,而且在所有环境中都能可靠地运行,具有很大的潜力,但尚未得到全面研究。在本文中,我们提出了一种名为汽车雷达网络(CRN)的新型摄像机-雷达融合框架,它可以为各种任务生成语义丰富、空间准确的俯视特征图。为了克服图像中缺乏空间信息的缺点,我们利用稀疏但准确的雷达点将透视图像特征转换为俯视特征。我们进一步使用多模态可变形注意力将俯视图像和雷达特征图在BEV中聚合,以解决输入之间的空间不对齐问题。CRN采用实时设置,在nuScenes上具有与LiDAR探测器相当的性能,速度达到20 FPS,在100米设置下在远距离时甚至优于LiDAR。此外,CRN在离线设置下在nuScenes测试集上获得了62.4%的NDS、57.5%的mAP,并在所有摄像机和摄像机-雷达三维物体检测器中排名第一。