On end-to-end driving, a large amount of expert driving demonstrations is used to train an agent that mimics the expert by predicting its control actions. This process is self-supervised on vehicle signals (e.g., steering angle, acceleration) and does not require extra costly supervision (human labeling). Yet, the improvement of existing self-supervised end-to-end driving models has mostly given room to modular end-to-end models where labeling data intensive format such as semantic segmentation are required during training time. However, we argue that the latest self-supervised end-to-end models were developed in sub-optimal conditions with low-resolution images and no attention mechanisms. Further, those models are confined with limited field of view and far from the human visual cognition which can quickly attend far-apart scene features, a trait that provides an useful inductive bias. In this context, we present a new end-to-end model, trained by self-supervised imitation learning, leveraging a large field of view and a self-attention mechanism. These settings are more contributing to the agent's understanding of the driving scene, which brings a better imitation of human drivers. With only self-supervised training data, our model yields almost expert performance in CARLA's Nocrash metrics and could be rival to the SOTA models requiring large amounts of human labeled data. To facilitate further research, our code will be released.
翻译:在端对端驾驶上,大量专家驾驶演示被用来培训一个代理,通过预测其控制行动来模仿专家。这一过程由车辆信号(例如,方向角度、加速)自我监督,不需要额外昂贵的监督(人类标签 ) 。然而,改进现有的自监督端对端驾驶模型,主要为模块化端对端驾驶模型提供了空间,在模块化端对端驾驶模型上贴上数据密集格式标签,例如培训时间需要语义分解。然而,我们认为,最新的自监督端对端模型是在亚最佳条件下开发的,其分辨率低,没有关注机制。此外,这些模型的外观范围有限,远离人类视觉认知,可以很快看到远方的场景特征,这种特征提供了一种有用的诱导偏差。在这方面,我们提出了一个新的端对端模式,通过自我监督模仿学习培训,利用大视野领域和自我保护机制,在次优的状态下开发了自我监督的终端模型。这些模型的设置将更能促进大型的SOL模型的运行,而这种模型只能成为人类服务器的升级数据。