Recent works have revealed the superiority of feature-level fusion for cross-modal 3D object detection, where fine-grained feature propagation from 2D image pixels to 3D LiDAR points has been widely adopted for performance improvement. Still, the potential of heterogeneous feature propagation between 2D and 3D domains has not been fully explored. In this paper, in contrast to existing pixel-to-point feature propagation, we investigate an opposite point-to-pixel direction, allowing point-wise features to flow inversely into the 2D image branch. Thus, when jointly optimizing the 2D and 3D streams, the gradients back-propagated from the 2D image branch can boost the representation ability of the 3D backbone network working on LiDAR point clouds. Then, combining pixel-to-point and point-to-pixel information flow mechanisms, we construct an bidirectional feature propagation framework, dubbed BiProDet. In addition to the architectural design, we also propose normalized local coordinates map estimation, a new 2D auxiliary task for the training of the 2D image branch, which facilitates learning local spatial-aware features from the image modality and implicitly enhances the overall 3D detection performance. Extensive experiments and ablation studies validate the effectiveness of our method. Notably, we rank $\mathbf{1^{\mathrm{st}}}$ on the highly competitive KITTI benchmark on the cyclist class by the time of submission. The source code is available at https://github.com/Eaphan/BiProDet.
翻译:近期的作品揭示了地平级融合对于跨模式 3D 对象检测的优越性, 其中,从 2D 图像像素到 3D LiDAR 点的细微放大特性传播被广泛采用来提高性能。 然而, 2D 和 3D 域间混杂特性传播的潜力还没有得到充分探索。 在本文中, 与现有的像素至点特征传播机制相比, 我们调查了一个相反的点到点方向, 让点到点的特征可以反向流到 2D 级图像分支。 因此, 当2D 图像分支联合优化 2D 和 3D 流时, 2D 的梯度反向重新调整来提高 3D 主干网在 LiDAR 点云上的工作能力。 然后, 将像素到点和点到 点到像素流信息流机制相结合, 我们建立一个双向地基地基地基地基地基 传播框架, 除了建筑设计之外, 我们还提议对本地坐标进行标准化的地图估算, 一个用于培训 2D 时间规则 的新的 2D 版本的梯段的辅助任务, 将提高 的测试 方法 。