VPFusion：面向 3D 目标检测的鲁棒垂直表示学习 (VPFusion: Towards Robust Vertical Representation Learning for 3D Object Detection)

Efficient point cloud representation is a fundamental element of Lidar-based 3D object detection. Recent grid-based detectors usually divide point clouds into voxels or pillars and construct single-stream networks in Bird's Eye View. However, these point cloud encoding paradigms underestimate the point representation in the vertical direction, which cause the loss of semantic or fine-grained information, especially for vertical sensitive objects like pedestrian and cyclists. In this paper, we propose an explicit vertical multi-scale representation learning framework, VPFusion, to combine the complementary information from both voxel and pillar streams. Specifically, VPFusion first builds upon a sparse voxel-pillar-based backbone. The backbone divides point clouds into voxels and pillars, then encodes features with 3D and 2D sparse convolution simultaneously. Next, we introduce the Sparse Fusion Layer (SFL), which establishes a bidirectional pathway for sparse voxel and pillar features to enable the interaction between them. Additionally, we present the Dense Fusion Neck (DFN) to effectively combine the dense feature maps from voxel and pillar branches with multi-scale. Extensive experiments on the large-scale Waymo Open Dataset and nuScenes Dataset demonstrate that VPFusion surpasses the single-stream baselines by a large margin and achieves state-of-the-art performance with real-time inference speed.

翻译：点云表示效率是基于 Lidar 的 3D 目标检测中的一个基本元素。最近的基于网格的检测器通常将点云分为体素或柱，并在 Bird's Eye View 中构建单流网络。然而，这些点云编码范式低估了垂直方向上的点表示，这会导致语义或细粒度信息的丢失，特别是对于垂直敏感的对象（如行人和骑车人）。在本文中，我们提出了一个明确的垂直多尺度表示学习框架 VPFusion，以结合来自体素和柱流的互补信息。具体而言，VPFusion 首先建立在稀疏的基于体素和柱的骨干上。骨干将点云分为体素和柱，然后同时使用 3D 和 2D 稀疏卷积编码特征。接下来，我们介绍了稀疏融合层（SFL），它为稀疏体素和柱特征建立了双向路径，以使它们之间产生交互。此外，我们提出了密集融合颈（DFN），以有效地结合来自体素和柱分支的多尺度密集特征图。对大规模 Waymo 开放数据集和 nuScenes 数据集进行的广泛实验表明，VPFusion 超过了单流基线，实时推断速度达到了最先进水平。

相关内容

点云

关注 48

根据激光测量原理得到的点云，包括三维坐标（XYZ）和激光反射强度（Intensity）。根据摄影测量原理得到的点云，包括三维坐标（XYZ）和颜色信息（RGB）。结合激光测量和摄影测量原理得到点云，包括三维坐标（XYZ）、激光反射强度（Intensity）和颜色信息（RGB）。在获取物体表面每个采样点的空间坐标后，得到的是一个点的集合，称之为“点云”(Point Cloud)

【CVPR2022】自动驾驶中的伪双目三维目标检测，Pseudo-Stereo for Monocular 3D Object Detection in Autonomous Driving

专知会员服务

18+阅读 · 2022年3月19日

【CVPR 2022】单目3D语义场景完成框架，MonoScene: Monocular 3D Semantic Scene Completion

专知会员服务

15+阅读 · 2022年3月3日

近期必读的六篇 ECCV 2020【行人重识别（ReID）】相关论文

专知会员服务

36+阅读 · 2020年8月4日

近期必读的六篇计算机视觉顶会ECCV 2020【目标检测】相关论文

专知会员服务

59+阅读 · 2020年7月7日