We learn an interactive vision-based driving policy from pre-recorded driving logs via a model-based approach. A forward model of the world supervises a driving policy that predicts the outcome of any potential driving trajectory. To support learning from pre-recorded logs, we assume that the world is on rails, meaning neither the agent nor its actions influence the environment. This assumption greatly simplifies the learning problem, factorizing the dynamics into a nonreactive world model and a low-dimensional and compact forward model of the ego-vehicle. Our approach computes action-values for each training trajectory using a tabular dynamic-programming evaluation of the Bellman equations; these action-values in turn supervise the final vision-based driving policy. Despite the world-on-rails assumption, the final driving policy acts well in a dynamic and reactive world. At the time of writing, our method ranks first on the CARLA leaderboard, attaining a 25% higher driving score while using 40 times less data. Our method is also an order of magnitude more sample-efficient than state-of-the-art model-free reinforcement learning techniques on navigational tasks in the ProcGen benchmark.
翻译:我们从基于模型的预先记录的驾驶记录中学会了一种交互式的基于愿景的驾驶政策。一个世界前方模式监督一种预测任何潜在驾驶轨迹结果的驾驶政策。为了支持从预先记录的航海记录中学习,我们假设世界在铁路上,这并不意味着代理人或其行动对环境的影响。这一假设大大简化了学习问题,将动态因素化为非反应性世界模式和自我车辆的低维和紧凑前方模式。我们的方法利用对贝尔曼方程式的表格动态-方案化评价来计算每个培训轨迹的行动价值;这些行动价值反过来又监督最终的基于愿景的驾驶政策。尽管世界在轨迹上的假设,但最终的驾驶政策在动态和反应性世界中运作良好。在撰写本报告时,我们的方法首先排在CARLA的领头板上,在使用40倍的数据时达到25%的驾驶分高。我们的方法也是在ProG基准中,比不使用状态的无型模型强化学习任务的规模要高得多。