In this paper, we present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems. Unlike existing studies focusing on the improvement of single-task approaches, BEVerse features in producing spatio-temporal Birds-Eye-View (BEV) representations from multi-camera videos and jointly reasoning about multiple tasks for vision-centric autonomous driving. Specifically, BEVerse first performs shared feature extraction and lifting to generate 4D BEV representations from multi-timestamp and multi-view images. After the ego-motion alignment, the spatio-temporal encoder is utilized for further feature extraction in BEV. Finally, multiple task decoders are attached for joint reasoning and prediction. Within the decoders, we propose the grid sampler to generate BEV features with different ranges and granularities for different tasks. Also, we design the method of iterative flow for memory-efficient future prediction. We show that the temporal information improves 3D object detection and semantic map construction, while the multi-task learning can implicitly benefit motion prediction. With extensive experiments on the nuScenes dataset, we show that the multi-task BEVerse outperforms existing single-task methods on 3D object detection, semantic map construction, and motion prediction. Compared with the sequential paradigm, BEVerse also favors in significantly improved efficiency. The code and trained models will be released at https://github.com/zhangyp15/BEVerse.
翻译:在本文中,我们介绍BEVerse, 这是一个基于多镜头系统的3D感知和预测的统一框架。 与侧重于改进单塔方法的现有研究不同, 制作spatio- 时空鸟- Eye- View (BEV) 演示的BEVerse功能来自多镜头视频, 并共同推理视觉中心自主驱动的多重任务。 具体地说, BEVerse首先进行共享特征提取和提升, 从多时印和多视图图像中生成 4D BEV 显示 。 在自我感动调整后, spatio- 时空编码用于在 BEV 中进一步提取特性。 最后, 在联合推理和预测中附加多个任务解析器。 在解析器中,我们建议网格采样器生成不同范围的BEV特性和颗粒的多重任务。 此外, 我们设计了未来记忆高效预测的迭代流方法。 我们显示, 时间信息将改进 3D 对象探测和语系图的构造, 而多塔项学习可以隐含地有利于动态预测。 在Bnus- laves laved 3- lax lax laved s