We learn an interactive vision-based driving policy from pre-recorded driving logs via a model-based approach. A forward model of the world supervises a driving policy that predicts the outcome of any potential driving trajectory. To support learning from pre-recorded logs, we assume that the world is on rails, meaning neither the agent nor its actions influence the environment. This assumption greatly simplifies the learning problem, factorizing the dynamics into a nonreactive world model and a low-dimensional and compact forward model of the ego-vehicle. Our approach computes action-values for each training trajectory using a tabular dynamic-programming evaluation of the Bellman equations; these action-values in turn supervise the final vision-based driving policy. Despite the world-on-rails assumption, the final driving policy acts well in a dynamic and reactive world. Our method ranks first on the CARLA leaderboard, attaining a 25% higher driving score while using 40 times less data. Our method is also an order of magnitude more sample-efficient than state-of-the-art model-free reinforcement learning techniques on navigational tasks in the ProcGen benchmark.
翻译:我们从基于模型的预先记录的驾驶记录中学习了一种交互式的基于愿景的驾驶政策。 一种世界前方模式监督一种预测任何潜在驾驶轨迹结果的驾驶政策。 为了支持从预先记录的航海记录中学习,我们假设世界在铁路上,并不意味着代理人或其行动对环境的影响。 这一假设大大简化了学习问题,将动态因素纳入非反应世界模式和自我车辆的低维和紧凑前方模式。 我们的方法利用对贝尔曼方程式的表列动态-方案化评价为每个培训轨迹计算行动价值;这些行动值反过来又监督最终基于愿景的驾驶政策。 尽管世界在轨迹上的假设,但最终驾驶政策在动态和反应性世界中运作良好。 我们的方法在CARLA领导板上排名第一,在使用40倍的数据的同时达到25%的驾驶分数。 我们的方法也比ProcG基准中的导航任务方面最先进的无型强化示范学习技术要高得多。