Efficient point cloud representation is a fundamental element of Lidar-based 3D object detection. Recent grid-based detectors usually divide point clouds into voxels or pillars and construct single-stream networks in Bird's Eye View. However, these point cloud encoding paradigms underestimate the point representation in the vertical direction, which cause the loss of semantic or fine-grained information, especially for vertical sensitive objects like pedestrian and cyclists. In this paper, we propose an explicit vertical multi-scale representation learning framework, VPFusion, to combine the complementary information from both voxel and pillar streams. Specifically, VPFusion first builds upon a sparse voxel-pillar-based backbone. The backbone divides point clouds into voxels and pillars, then encodes features with 3D and 2D sparse convolution simultaneously. Next, we introduce the Sparse Fusion Layer (SFL), which establishes a bidirectional pathway for sparse voxel and pillar features to enable the interaction between them. Additionally, we present the Dense Fusion Neck (DFN) to effectively combine the dense feature maps from voxel and pillar branches with multi-scale. Extensive experiments on the large-scale Waymo Open Dataset and nuScenes Dataset demonstrate that VPFusion surpasses the single-stream baselines by a large margin and achieves state-of-the-art performance with real-time inference speed.
翻译:点云表示效率是基于 Lidar 的 3D 目标检测中的一个基本元素。最近的基于网格的检测器通常将点云分为体素或柱,并在 Bird's Eye View 中构建单流网络。然而,这些点云编码范式低估了垂直方向上的点表示,这会导致语义或细粒度信息的丢失,特别是对于垂直敏感的对象(如行人和骑车人)。在本文中,我们提出了一个明确的垂直多尺度表示学习框架 VPFusion,以结合来自体素和柱流的互补信息。具体而言,VPFusion 首先建立在稀疏的基于体素和柱的骨干上。骨干将点云分为体素和柱,然后同时使用 3D 和 2D 稀疏卷积编码特征。接下来,我们介绍了稀疏融合层(SFL),它为稀疏体素和柱特征建立了双向路径,以使它们之间产生交互。此外,我们提出了密集融合颈(DFN),以有效地结合来自体素和柱分支的多尺度密集特征图。对大规模 Waymo 开放数据集和 nuScenes 数据集进行的广泛实验表明,VPFusion 超过了单流基线,实时推断速度达到了最先进水平。