3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9\% in terms of NDS metric on the nuScenes \texttt{test} set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. We further show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The code is available at \url{https://github.com/zhiqi-li/BEVFormer}.
翻译:3D 视觉感知任务, 包括基于多相机图像的 3D 探测和地图分割任务, 对自主驱动系统至关重要 。 在这项工作中, 我们提出了一个名为 BEVFormer 的新框架, 该框架与时空变异器学习统一的 BEV 表达方式, 以支持多重自主感知任务 。 简而言之, BEVFormer 通过预先定义的网格形状 BEV 查询与空间和时空空间互动, 从而利用空间和时间信息 。 为了汇总空间信息, 我们设计空间交叉注意, 每份 BEV 查询都从不同摄像头视图感兴趣的区域提取空间特征 。 关于时间信息, 我们提出时间性自我意识, 以便经常将历史 BEV 信息连接起来 。 我们的方法在 NDS 的 NDS 参数上实现了新的56.9 。 NDS 值比以往的最佳艺术高9. 0 点, 与基于 LDAR 的基线的性能 。 我们进一步显示 BEVFFormer 明显提高了速度估计和在低可见度条件下的物体的精确性 。