Vision-centric joint perception and prediction (PnP) has become an emerging trend in autonomous driving research. It predicts the future states of the traffic participants in the surrounding environment from raw RGB images. However, it is still a critical challenge to synchronize features obtained at multiple camera views and timestamps due to inevitable geometric distortions and further exploit those spatial-temporal features. To address this issue, we propose a temporal bird's-eye-view pyramid transformer (TBP-Former) for vision-centric PnP, which includes two novel designs. First, a pose-synchronized BEV encoder is proposed to map raw image inputs with any camera pose at any time to a shared and synchronized BEV space for better spatial-temporal synchronization. Second, a spatial-temporal pyramid transformer is introduced to comprehensively extract multi-scale BEV features and predict future BEV states with the support of spatial-temporal priors. Extensive experiments on nuScenes dataset show that our proposed framework overall outperforms all state-of-the-art vision-based prediction methods.
翻译:视觉中心联合感知和预测已成为自主驾驶研究的新兴趋势。它从原始的RGB图像中预测周围环境中交通参与者的未来状态。然而,由于不可避免的几何畸变和进一步的空间-时间特征利用,同步获取多个摄像头视图和时间戳下获得的特征仍然是一个关键难题。为了解决这个问题,我们提出了一个时间顶视金字塔变换器(TBP-Former),它包括两个新颖的设计。首先,提出了一个姿态同步的BEV编码器,用于将在任何时间的任何相机姿态下获得的原始图像输入映射到共享和同步的BEV空间,以实现更好的空间-时间同步。其次,引入了一个空间-时间金字塔变换器来全面提取多尺度BEV特征,并在空间-时间先验的支持下预测未来的BEV状态。对nuScenes数据集进行的大量实验表明,我们提出的框架总体上优于所有最先进的基于视觉的预测方法。