Multi-modal 3D object detection has been an active research topic in autonomous driving. Nevertheless, it is non-trivial to explore the cross-modal feature fusion between sparse 3D points and dense 2D pixels. Recent approaches either fuse the image features with the point cloud features that are projected onto the 2D image plane or combine the sparse point cloud with dense image pixels. These fusion approaches often suffer from severe information loss, thus causing sub-optimal performance. To address these problems, we construct the homogeneous structure between the point cloud and images to avoid projective information loss by transforming the camera features into the LiDAR 3D space. In this paper, we propose a homogeneous multi-modal feature fusion and interaction method (HMFI) for 3D object detection. Specifically, we first design an image voxel lifter module (IVLM) to lift 2D image features into the 3D space and generate homogeneous image voxel features. Then, we fuse the voxelized point cloud features with the image features from different regions by introducing the self-attention based query fusion mechanism (QFM). Next, we propose a voxel feature interaction module (VFIM) to enforce the consistency of semantic information from identical objects in the homogeneous point cloud and image voxel representations, which can provide object-level alignment guidance for cross-modal feature fusion and strengthen the discriminative ability in complex backgrounds. We conduct extensive experiments on the KITTI and Waymo Open Dataset, and the proposed HMFI achieves better performance compared with the state-of-the-art multi-modal methods. Particularly, for the 3D detection of cyclist on the KITTI benchmark, HMFI surpasses all the published algorithms by a large margin.
翻译:多式 3D 对象探测一直是自主驱动中一个积极的研究课题。 然而, 探索三维点与稠密 2D 像素之间的交叉模式特征融合是非三维的。 最近的方法要么将图像特征与投射在 2D 图像平面上的点云特征结合起来, 要么将稀有点云与稠密图像像像像素结合起来。 这些聚合方法往往会遭受严重的信息损失, 从而造成亚最佳性能。 为了解决这些问题, 我们构建点云和图像之间的同质结构, 以避免投影信息损失。 通过将相机特性转换到 利达 3D 空间。 在本文件中, 我们提出一个同质的多式多式特征聚合和互动方法( HMFI ) 用于检测 3D 对象。 具体地, 我们首先设计一个图像显像性增强3D 空间, 并生成同质的图像性能。 然后, 我们通过引入基于自我识别的调控调的多式图像背景(QFMFI ), 和高性性能演示中, 我们提议一个用于高性能的立像标的功能, 。