In this work, we present a conceptually simple yet effective framework for cross-modality 3D object detection, named voxel field fusion. The proposed approach aims to maintain cross-modality consistency by representing and fusing augmented image features as a ray in the voxel field. To this end, the learnable sampler is first designed to sample vital features from the image plane that are projected to the voxel grid in a point-to-ray manner, which maintains the consistency in feature representation with spatial context. In addition, ray-wise fusion is conducted to fuse features with the supplemental context in the constructed voxel field. We further develop mixed augmentor to align feature-variant transformations, which bridges the modality gap in data augmentation. The proposed framework is demonstrated to achieve consistent gains in various benchmarks and outperforms previous fusion-based methods on KITTI and nuScenes datasets. Code is made available at https://github.com/dvlab-research/VFF.
翻译:在这项工作中,我们提出了一个概念上简单而有效的三维跨时空物体探测框架,称为 voxel 场外聚变。拟议方法旨在通过在 voxel 场中将增强的图像特征作为射线来表示和引信,保持交叉时态的一致性。为此,可学习的取样器首先设计为从投射到 voxel 网格的图像平面上取样关键特征,以点到线的方式保持地貌代表与空间环境的一致性。此外,还进行射线聚合,使所建的 voxel 场的特征与补充环境结合。我们进一步开发混合增强器,以协调功能变异性变异性变异性变化,从而弥补数据增加模式上的差距。拟议框架表明在各种基准中取得一致的收益,并超越了以前在 KITTI 和 nuScenes 数据集上采用的基于聚变法。代码可在https://github.com/dvlab-research/VFF上查阅。