The transformation of features from 2D perspective space to 3D space is essential to multi-view 3D object detection. Recent approaches mainly focus on the design of view transformation, either pixel-wisely lifting perspective view features into 3D space with estimated depth or grid-wisely constructing BEV features via 3D projection, treating all pixels or grids equally. However, choosing what to transform is also important but has rarely been discussed before. The pixels of a moving car are more informative than the pixels of the sky. To fully utilize the information contained in images, the view transformation should be able to adapt to different image regions according to their contents. In this paper, we propose a novel framework named FrustumFormer, which pays more attention to the features in instance regions via adaptive instance-aware resampling. Specifically, the model obtains instance frustums on the bird's eye view by leveraging image view object proposals. An adaptive occupancy mask within the instance frustum is learned to refine the instance location. Moreover, the temporal frustum intersection could further reduce the localization uncertainty of objects. Comprehensive experiments on the nuScenes dataset demonstrate the effectiveness of FrustumFormer, and we achieve a new state-of-the-art performance on the benchmark. Codes will be released soon.
翻译:从 2D 角度空间到 3D 空间的地貌转换对于多视图 3D 对象检测至关重要。 最近的方法主要侧重于视图转换的设计, 要么是像素明智地提升视角视图功能, 要么是三维空间, 估计深度或以网格明智的方式通过 3D 投影构建 BEV 特征, 平等对待所有像素或网格。 但是, 选择要变换的像素也很重要, 但以前很少讨论过。 移动汽车的像素比天空的像素更加丰富。 要充分利用图像中所含的信息, 视图转换应该能够根据图像的内容适应不同的图像区域。 在本文件中, 我们提出了一个名为 Frustem Former 的新框架, 该框架通过适应性实例- 图像重新标注的方式, 更多地关注实例区域的地貌特征。 具体地说, 模型在鸟的视觉视图上获取实例的结晶体图, 样中学习一个适应性的占用面罩, 来改进实例位置的位置。 此外, 时间的骨质交交叉可以进一步减少物体的本地化不确定性。 在新标准上的全面实验, 将很快地显示我们获得的状态。