Many existing autonomous driving paradigms involve a multi-stage discrete pipeline of tasks. To better predict the control signals and enhance user safety, an end-to-end approach that benefits from joint spatial-temporal feature learning is desirable. While there are some pioneering works on LiDAR-based input or implicit design, in this paper we formulate the problem in an interpretable vision-based setting. In particular, we propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously, which is called ST-P3. Specifically, an egocentric-aligned accumulation technique is proposed to preserve geometry information in 3D space before the bird's eye view transformation for perception; a dual pathway modeling is devised to take past motion variations into account for future prediction; a temporal-based refinement unit is introduced to compensate for recognizing vision-based elements for planning. To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system. We benchmark our approach against previous state-of-the-arts on both open-loop nuScenes dataset as well as closed-loop CARLA simulation. The results show the effectiveness of our method. Source code, model and protocol details are made publicly available at https://github.com/OpenPerceptionX/ST-P3.
翻译:许多现有的自主驱动模式涉及多阶段的分流任务管道。为了更好地预测控制信号并增强用户安全,有必要采用从联合空间-时空特征学习中受益的端对端方法。虽然在基于LIDAR的投入或暗含设计方面有一些开创性工作,但在本文件中,我们以可解释的愿景为基础来阐述这一问题。特别是,我们提出了一个空间-时特征学习计划,目的是同时为感知、预测和规划任务建立一套更具代表性的特征,称为ST-P3。具体地说,我们提议一种自我中心式的积累技术,以便在鸟眼观转换为感知之前保护3D空间的几何学信息;设计一种双轨模型,将过去运动的变化纳入未来预测;引入基于时间的改进单元,以补偿对基于愿景的规划要素的认知。我们最了解的是,我们首先系统地调查一个可解释端-端-端基于愿景的自主驱动系统的每一部分。我们用一种自我中心-中心-中心组合的累积技术,在鸟眼界视图转换为感知知知知觉的3之前,我们的方法以先前的状态为基准,以保存3 ;设计一种双向路径模型/Scen-Sceneves详细数据系统,作为公开的模拟工具的模拟工具显示工具/格式的公开数据格式,我们现有的工具的源代码显示。可关闭的源码。