3D object detectors usually rely on hand-crafted proxies, e.g., anchors or centers, and translate well-studied 2D frameworks to 3D. Thus, sparse voxel features need to be densified and processed by dense prediction heads, which inevitably costs extra computation. In this paper, we instead propose VoxelNext for fully sparse 3D object detection. Our core insight is to predict objects directly based on sparse voxel features, without relying on hand-crafted proxies. Our strong sparse convolutional network VoxelNeXt detects and tracks 3D objects through voxel features entirely. It is an elegant and efficient framework, with no need for sparse-to-dense conversion or NMS post-processing. Our method achieves a better speed-accuracy trade-off than other mainframe detectors on the nuScenes dataset. For the first time, we show that a fully sparse voxel-based representation works decently for LIDAR 3D object detection and tracking. Extensive experiments on nuScenes, Waymo, and Argoverse2 benchmarks validate the effectiveness of our approach. Without bells and whistles, our model outperforms all existing LIDAR methods on the nuScenes tracking test benchmark.
翻译:3D目标检测器通常依赖于手工制作的代理,例如锚点或中心,并将经过充分研究的2D框架移植到3D。因此,需要将稀疏的体素特征密集化并通过密集预测头进行处理,这必然会产生额外的计算成本。在本文中,我们提出了完全稀疏的3D目标检测方法VoxelNext。我们的核心见解是直接基于稀疏Voxel特征来预测对象,而不依赖于手工制作的代理。我们强大的稀疏卷积网络VoxelNeXt通过体素特征完全检测和跟踪3D对象。这是一个优雅且高效的框架,不需要稀疏到密集的转换或NMS后处理。我们的方法在nuScenes数据集上实现了更好的速度-精度权衡。我们首次展示了完全稀疏的基于体素的表示对于LIDAR 3D目标检测和跟踪的工作方式。在nuScenes、Waymo和Argoverse2基准测试上进行的大量实验验证了我们方法的有效性。在没有花哨的情况下,我们的模型在nuScenes跟踪测试基准测试上优于所有现有的LIDAR方法。