Object detection through either RGB images or the LiDAR point clouds has been extensively explored in autonomous driving. However, it remains challenging to make these two data sources complementary and beneficial to each other. In this paper, we propose \textit{AutoAlign}, an automatic feature fusion strategy for 3D object detection. Instead of establishing deterministic correspondence with camera projection matrix, we model the mapping relationship between the image and point clouds with a learnable alignment map. This map enables our model to automate the alignment of non-homogenous features in a dynamic and data-driven manner. Specifically, a cross-attention feature alignment module is devised to adaptively aggregate \textit{pixel-level} image features for each voxel. To enhance the semantic consistency during feature alignment, we also design a self-supervised cross-modal feature interaction module, through which the model can learn feature aggregation with \textit{instance-level} feature guidance. Extensive experimental results show that our approach can lead to 2.3 mAP and 7.0 mAP improvements on the KITTI and nuScenes datasets, respectively. Notably, our best model reaches 70.9 NDS on the nuScenes testing leaderboard, achieving competitive performance among various state-of-the-arts.
翻译:通过 RGB 图像或 LiDAR 点云探测对象, 在自动驱动中广泛探索了 RGB 图像或 LIDAR 点云。 然而, 使这两个数据源相互补充并相互受益仍具有挑战性。 在本文中, 我们提议3D 对象探测的自动特性聚合策略 。 我们不与相机投影矩阵建立确定性对应, 而是用可学习的校准地图来模拟图像和点云之间的映射关系。 这个映射让我们的模型能够以动态和数据驱动的方式自动匹配非混合特征。 具体地说, 设计一个交叉注意特性校正模块, 以适应性地综合 \ textit{ pixel- level} 每个 voxel 的图像特征。 为了在功能匹配期间加强语义一致性, 我们还设计了一个自我监督的跨模式特征互动模块, 通过该模型可以学习与\ textititle{ intencyle) 地标指导的特征聚合。 广泛的实验结果显示, 我们的方法可以导致 2.3 mAP 和 7. AP 改进 KITTI 和 NScenS checkes 分别实现我们的最佳测试的状态。