The recent trend for multi-camera 3D object detection is through the unified bird's-eye view (BEV) representation. However, directly transforming features extracted from the image-plane view to BEV inevitably results in feature distortion, especially around the objects of interest, making the objects blur into the background. To this end, we propose OA-BEV, a network that can be plugged into the BEV-based 3D object detection framework to bring out the objects by incorporating object-aware pseudo-3D features and depth features. Such features contain information about the object's position and 3D structures. First, we explicitly guide the network to learn the depth distribution by object-level supervision from each 3D object's center. Then, we select the foreground pixels by a 2D object detector and project them into 3D space for pseudo-voxel feature encoding. Finally, the object-aware depth features and pseudo-voxel features are incorporated into the BEV representation with a deformable attention mechanism. We conduct extensive experiments on the nuScenes dataset to validate the merits of our proposed OA-BEV. Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and nuScenes detection score. Our codes will be published.
翻译:多相机 3D 对象探测的最近趋势是通过鸟眼统一视图(BEV) 显示。 但是, 从图像- 平面视图中提取的特征直接转换为 BEV 必然导致地貌扭曲, 特别是对象周围的特征扭曲, 使对象模糊到背景中。 为此, 我们提议 OA- BEV, 一个可以插入基于 BEV 的 3D 对象探测框架的网络, 以纳入对象认知伪3D 特征和深度特征的方式将对象引入 BEV 显示。 这些特征包含关于对象位置和 3D 结构的信息。 首先, 我们明确指导网络从每个 3D 对象中心通过目标级别监督来学习深度分布。 然后, 我们通过 2D 对象探测器选择地表像素, 并将它们投放到 3D 空间, 用于伪vox 特性编码。 最后, 对象认知深度特征和伪vox 特征以可变的注意机制纳入 BEV 表示方式。 我们通过 NEV 进行广泛的实验, 来验证我们所公布的OA- BS- CEV 平均测算基准参数的优点。