End-to-end autonomous driving provides a feasible way to automatically maximize overall driving system performance by directly mapping the raw pixels from a front-facing camera to control signals. Recent advanced methods construct a latent world model to map the high dimensional observations into compact latent space. However, the latent states embedded by the world model proposed in previous works may contain a large amount of task-irrelevant information, resulting in low sampling efficiency and poor robustness to input perturbations. Meanwhile, the training data distribution is usually unbalanced, and the learned policy is hard to cope with the corner cases during the driving process. To solve the above challenges, we present a semantic masked recurrent world model (SEM2), which introduces a latent filter to extract key task-relevant features and reconstruct a semantic mask via the filtered features, and is trained with a multi-source data sampler, which aggregates common data and multiple corner case data in a single batch, to balance the data distribution. Extensive experiments on CARLA show that our method outperforms the state-of-the-art approaches in terms of sample efficiency and robustness to input permutations.
翻译:端到端自动驾驶提供了一种可行的方法,通过直接从前方照相机直接绘制原始像素图,使整个驱动系统性能自动最大化,从前方摄像头到控制信号。最近先进的方法建立了一个潜伏世界模型,将高维观测图绘制成紧凑潜伏空间。然而,以前作品中提议的世界模型所蕴含的潜伏状态可能包含大量与任务有关的信息,导致取样效率低,输入扰动时输入的强度差。与此同时,培训数据分布通常不平衡,而所学的政策在驱动过程中难以应对角落的情况。为了解决上述挑战,我们提出了一个语义化的隐性隐性世界模型(SEM2),该模型引入潜伏过滤器,以提取关键任务相关特征,并通过过滤功能重建一个语义遮掩,并经过多源数据取样器的培训,该取样器将共同数据和多角立案数据汇总成单批,以平衡数据分布。关于CARA的广泛实验显示,我们的方法在抽样效率和输入透度方面超越了最新方法。