Modern methods for vision-centric autonomous driving perception widely adopt the bird's-eye-view (BEV) representation to describe a 3D scene. Despite its better efficiency than voxel representation, it has difficulty describing the fine-grained 3D structure of a scene with a single plane. To address this, we propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes. We model each point in the 3D space by summing its projected features on the three planes. To lift image features to the 3D TPV space, we further propose a transformer-based TPV encoder (TPVFormer) to obtain the TPV features effectively. We employ the attention mechanism to aggregate the image features corresponding to each query in each TPV plane. Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels. We demonstrate for the first time that using only camera inputs can achieve comparable performance with LiDAR-based methods on the LiDAR segmentation task on nuScenes. Code: https://github.com/wzzheng/TPVFormer.
翻译:以视觉为中心的自主驱动感知的现代方法广泛采用鸟眼视(BEV)表示法来描述三维场景。 尽管它比 voxel 表示法效率更高, 但它很难用单平面描述场景精细的三维结构。 为了解决这个问题, 我们建议使用三透视( TPV) 表示法, 与 BEV 和另外两架穿孔飞机相伴而生。 我们用三维空间的预测特征来模拟三维空间的每个点。 为了将图像特性提升到 3D TPV 空间, 我们进一步建议使用一个基于变压器的 TPV 编码器( TPV Former) 来有效获取 TPV 特征。 我们使用注意机制来汇总每个 TPV 平面上每个查询的图像特征。 实验显示, 我们训练的模型能有效预测所有 voxels 的语体占用率。 我们第一次证明, 仅使用摄像器输入就能在三维平面任务上以LDAR 方法实现类似性功能。 。 代码 : http://giuthub./w.</s>