Humans tend to drive vehicles efficiently by relying on contextual and spatial information through the sensory organs. Inspired by this, most of the research is focused on how to learn robust and efficient driving policies. These works are mostly categorized as making modular or end-to-end systems for learning driving policies. However, the former approach has limitations due to the manual supervision of specific modules that hinder the scalability of these systems. In this work, we focus on the latter approach to formalize a framework for learning driving policies for end-to-end autonomous driving. In order to take inspiration from human driving, we have proposed a framework that incorporates three RGB cameras (left, right, and center) to mimic the human field of view and top-down semantic information for contextual representation in predicting the driving policies for autonomous driving. The sensor information is fused and encoded by the self-attention mechanism and followed by the auto-regressive waypoint prediction module. The proposed method's efficacy is experimentally evaluated using the CARLA simulator and outperforms the state-of-the-art methods by achieving the highest driving score at the evaluation time.
翻译:人类往往通过感官器官依靠上下文和空间信息有效驱动车辆。 受此启发, 大部分研究侧重于如何学习稳健有效的驾驶政策。 这些作品大多被归类为为为学习驾驶政策的模块或端到端系统。 但是,前一种方法由于对阻碍这些系统扩缩的具体模块进行人工监督而受到限制。 在这项工作中, 我们侧重于后一种方法, 正式确定学习最终到端自动驾驶驱动政策的框架。 为了从人驾驶中得到启发, 我们提出了一个框架, 其中包括三部 RGB 相机( 左、 右、 中), 以模拟人类视野领域和上下方语义信息, 用于预测自主驾驶政策的背景代表。 传感器信息由自我注意机制整合和编码, 之后是自动递增路段预测模块。 拟议的方法的功效是使用 CARLA 模拟器进行实验性评估, 并超越了在评价时达到最高驾驶分数的状态方法。