An accurate model of the environment and the dynamic agents acting in it offers great potential for improving motion planning. We present MILE: a Model-based Imitation LEarning approach to jointly learn a model of the world and a policy for autonomous driving. Our method leverages 3D geometry as an inductive bias and learns a highly compact latent space directly from high-resolution videos of expert demonstrations. Our model is trained on an offline corpus of urban driving data, without any online interaction with the environment. MILE improves upon prior state-of-the-art by 31% in driving score on the CARLA simulator when deployed in a completely new town and new weather conditions. Our model can predict diverse and plausible states and actions, that can be interpretably decoded to bird's-eye view semantic segmentation. Further, we demonstrate that it can execute complex driving manoeuvres from plans entirely predicted in imagination. Our approach is the first camera-only method that models static scene, dynamic scene, and ego-behaviour in an urban driving environment. The code and model weights are available at https://github.com/wayveai/mile.
翻译:环境的精确模型及其中的动态代理人的精确模型为改进运动规划提供了巨大的潜力。我们介绍了MILE:基于模型的模拟光学模拟法,以共同学习世界模型和自主驾驶政策。我们的方法将3D几何作为诱导偏差,并直接从专家演示的高分辨率视频中学习高度紧凑的潜在空间。我们的模型在不与环境进行任何在线互动的情况下,在离线的城市驾驶数据堆中接受培训。MILE在部署于一个全新的城镇和新的天气条件下时,在CARLA模拟器的驾驶分数上提高了31%。我们的模型可以预测出多种多样和可信的状态和行动,可以被解释地解码到鸟眼的视觉语义分解。此外,我们证明它可以在想象力中完全预测到的全过程执行复杂的驾驶动作。我们的方法是第一种只用摄像器来模拟城市驾驶环境中的静态场、动态场景和自我驾驶能力。代码和模型重量可以在 https://github.com/wayai/mile/mile。