Vision-centric joint perception and prediction (PnP) has become an emerging trend in autonomous driving research. It predicts the future states of the traffic participants in the surrounding environment from raw RGB images. However, it is still a critical challenge to synchronize features obtained at multiple camera views and timestamps due to inevitable geometric distortions and further exploit those spatial-temporal features. To address this issue, we propose a temporal bird's-eye-view pyramid transformer (TBP-Former) for vision-centric PnP, which includes two novel designs. First, a pose-synchronized BEV encoder is proposed to map raw image inputs with any camera pose at any time to a shared and synchronized BEV space for better spatial-temporal synchronization. Second, a spatial-temporal pyramid transformer is introduced to comprehensively extract multi-scale BEV features and predict future BEV states with the support of spatial-temporal priors. Extensive experiments on nuScenes dataset show that our proposed framework overall outperforms all state-of-the-art vision-based prediction methods.
翻译:视觉中心的联合感知和预测已经成为自动驾驶研究的新趋势。它从原始RGB图像中预测周围环境中交通参与者的未来状态。由于不可避免的几何扭曲和进一步开发这些时空特征,多个视图和时间戳所获得的特征进行同步仍然是一个关键的挑战。为了解决这个问题,我们提出了一种适用于视觉中心PnP的时态鸟瞰图塔变形器(TBP-Former),它包括两个新设计。首先,提出了一个姿态同步的车鸟瞰(BEV)编码器,用于将任何时间的原始图像输入与任何相机姿态映射到共享和同步的BEV空间,以实现更好的时空同步。其次,介绍了一个时空金字塔变形器,用于全面提取多尺度的BEV特征,并在时空先验的支持下预测未来的BEV状态。在nuScenes数据集上进行的广泛实验表明,我们提出的框架整体上优于所有最先进的基于视觉的预测方法。