In this work we are the first to present an offline policy gradient method for learning imitative policies for complex urban driving from a large corpus of real-world demonstrations. This is achieved by building a differentiable data-driven simulator on top of perception outputs and high-fidelity HD maps of the area. It allows us to synthesize new driving experiences from existing demonstrations using mid-level representations. Using this simulator we then train a policy network in closed-loop employing policy gradients. We train our proposed method on 100 hours of expert demonstrations on urban roads and show that it learns complex driving policies that generalize well and can perform a variety of driving maneuvers. We demonstrate this in simulation as well as deploy our model to self-driving vehicles in the real-world. Our method outperforms previously demonstrated state-of-the-art for urban driving scenarios -- all this without the need for complex state perturbations or collecting additional on-policy data during training. We make code and data publicly available.
翻译:在这项工作中,我们首先提出一种离线政策梯度方法,从大量真实世界的演示中学习复杂城市驾驶模拟政策。这是通过在视觉输出和高忠诚度HD地图的基础上建立一个不同的数据驱动模拟器来实现的。它使我们能够综合使用中级演示的现有演示的新驾驶经验。利用这个模拟器,我们然后在使用政策梯度的封闭通道中培训一个政策网络。我们用100小时的专家在城市公路上进行专家演示来培训我们提出的方法,并表明它学习了各种复杂的驾驶政策,这些政策很全面,可以进行各种驾驶动作。我们在模拟中展示了这一点,并运用了我们在现实世界中自行驾驶车辆的模型。我们的方法超越了以前展示的城市驾驶情景的状态,所有这一切都不需要复杂的状态干扰,或者在培训中收集更多的政策数据。我们提供了代码和数据。