Fusing data from cameras and LiDAR sensors is an essential technique to achieve robust 3D object detection. One key challenge in camera-LiDAR fusion involves mitigating the large domain gap between the two sensors in terms of coordinates and data distribution when fusing their features. In this paper, we propose a novel camera-LiDAR fusion architecture called, 3D Dual-Fusion, which is designed to mitigate the gap between the feature representations of camera and LiDAR data. The proposed method fuses the features of the camera-view and 3D voxel-view domain and models their interactions through deformable attention. We redesign the transformer fusion encoder to aggregate the information from the two domains. Two major changes include 1) dual query-based deformable attention to fuse the dual-domain features interactively and 2) 3D local self-attention to encode the voxel-domain queries prior to dual-query decoding. The results of an experimental evaluation show that the proposed camera-LiDAR fusion architecture achieved competitive performance on the KITTI and nuScenes datasets, with state-of-the-art performances in some 3D object detection benchmarks categories.
翻译:摄像头和LIDAR传感器的引信数据是实现强力3D对象探测的关键技术。相机-LiDAR聚变中的一个关键挑战,是缩小两个传感器之间在坐标和数据分布方面的巨大领域差距。在本文中,我们提议建立一个名为3D双功能的新相机-LiDAR聚变结构,旨在缩小照相机和LIDAR数据特征显示和LIDAR数据之间差距。拟议方法通过可变的注意力结合相机-视图和3D voxel-Voxel-Vicel域的特征,并模拟它们的互动。我们重新设计变压器聚变压器,以汇总两个领域的信息。两个主要变化包括:(1) 基于双重查询的可变换注意,以互动方式结合双重特性;(2) 3D本地自我注意,以在双重解码解码之前对 voxel-doma查询进行编码。实验性评估的结果显示,拟议的相机-LiDAR聚变压结构在KITTI和Nusedes 3D数据检测中取得了竞争性性能基准。