The transformation of features from 2D perspective space to 3D space is essential to multi-view 3D object detection. Recent approaches mainly focus on the design of view transformation, either pixel-wisely lifting perspective view features into 3D space with estimated depth or grid-wisely constructing BEV features via 3D projection, treating all pixels or grids equally. However, choosing what to transform is also important but has rarely been discussed before. The pixels of a moving car are more informative than the pixels of the sky. To fully utilize the information contained in images, the view transformation should be able to adapt to different image regions according to their contents. In this paper, we propose a novel framework named FrustumFormer, which pays more attention to the features in instance regions via adaptive instance-aware resampling. Specifically, the model obtains instance frustums on the bird's eye view by leveraging image view object proposals. An adaptive occupancy mask within the instance frustum is learned to refine the instance location. Moreover, the temporal frustum intersection could further reduce the localization uncertainty of objects. Comprehensive experiments on the nuScenes dataset demonstrate the effectiveness of FrustumFormer, and we achieve a new state-of-the-art performance on the benchmark. Codes and models will be made available at https://github.com/Robertwyq/Frustum.
翻译:将特征从2D视角空间转换为3D空间对多视角3D物体检测至关重要。最近的方法主要关注视图转换的设计,通过估计深度将透视视图特征单个像素地提升到3D空间,或通过3D投影网格地构建BEV特征,将所有像素或网格视为相等。然而,选择转换什么也很重要,却很少被讨论。运动汽车的像素比天空的像素更具有信息量。为了充分利用图像中包含的信息,视图变换应能够根据它们的内容适应不同的图像区域。本文提出了一种名为 FrustumFormer 的新框架,通过自适应实例感知重采样更加关注实例区域中的特征。具体而言,模型通过利用图像视图对象提案,在鸟瞰视图中获取实例锥体。学习一个自适应占用掩码以在锥体内完善实例位置。此外,时间锥体交集可以进一步减少对象的定位不确定性。在 nuScenes 数据集上的综合实验证明了 FrustumFormer 的有效性,并在基准测试中实现了新的最先进性能。代码和模型将在 https://github.com/Robertwyq/Frustum 上公开发布。