In recent years, transformer-based detectors have demonstrated remarkable performance in 2D visual perception tasks. However, their performance in multi-view 3D object detection remains inferior to the state-of-the-art (SOTA) of convolutional neural network based detectors. In this work, we investigate this issue from the perspective of bird's-eye-view (BEV) feature generation. Specifically, we examine the BEV feature generation method employed by the transformer-based SOTA, BEVFormer, and identify its two limitations: (i) it only generates attention weights from BEV, which precludes the use of lidar points for supervision, and (ii) it aggregates camera view features to the BEV through deformable sampling, which only selects a small subset of features and fails to exploit all information. To overcome these limitations, we propose a novel BEV feature generation method, dual-view attention, which generates attention weights from both the BEV and camera view. This method encodes all camera features into the BEV feature. By combining dual-view attention with the BEVFormer architecture, we build a new detector named VoxelFormer. Extensive experiments are conducted on the nuScenes benchmark to verify the superiority of dual-view attention and VoxelForer. We observe that even only adopting 3 encoders and 1 historical frame during training, VoxelFormer still outperforms BEVFormer significantly. When trained in the same setting, VoxelFormer can surpass BEVFormer by 4.9% NDS point. Code is available at: https://github.com/Lizhuoling/VoxelFormer-public.git.
翻译:近年来,基于 Transformer 的检测器在 2D 视觉感知任务中表现出了卓越的性能。然而,与卷积神经网络(CNN)检测器的最新技术相比,它们在多视角 3D 目标检测中的性能仍不如最新技术。在这项工作中,我们从鸟瞰(BEV)特征生成的角度调查了这个问题。具体而言,我们研究了基于 Transformer 的最新技术 BEVFormer 使用的 BEV 特征生成方法,并确定了两个限制:(i)它仅从 BEV 中生成注意力权重,禁止使用激光雷达点进行监督,(ii)它通过可变采样将相机视图特征聚合到 BEV 中,这仅选择了一小部分特征并未充分利用所有信息。为了克服这些限制,我们提出了一种新的 BEV 特征生成方法:双视图注意力,它从 BEV 和相机视图中生成注意力权重。该方法将所有相机特征编码到 BEV 特征中。通过将双视图注意力与 BEVFormer 架构相结合,我们构建了一个名为 VoxelFormer 的新检测器。在 nuScenes 基准测试中进行了广泛的实验,验证了双视图注意力和 VoxelFormer 的优越性。我们观察到,即使仅在训练过程中采用 3 个编码器和 1 个历史帧,VoxelFormer 仍然显著优于 BEVFormer。在相同的设置下进行训练时,VoxelFormer 可以超过 BEVFormer 4.9% 的 NDS 点。代码可在 https://github.com/Lizhuoling/VoxelFormer-public.git 下载。