Driving in a complex urban environment is a difficult task that requires a complex decision policy. In order to make informed decisions, one needs to gain an understanding of the long-range context and the importance of other vehicles. In this work, we propose to use Vision Transformer (ViT) to learn a driving policy in urban settings with birds-eye-view (BEV) input images. The ViT network learns the global context of the scene more effectively than with earlier proposed Convolutional Neural Networks (ConvNets). Furthermore, ViT's attention mechanism helps to learn an attention map for the scene which allows the ego car to determine which surrounding cars are important to its next decision. We demonstrate that a DQN agent with a ViT backbone outperforms baseline algorithms with ConvNet backbones pre-trained in various ways. In particular, the proposed method helps reinforcement learning algorithms to learn faster, with increased performance and less data than baselines.
翻译:在一个复杂的城市环境中驾驶是一项艰巨的任务,需要复杂的决策政策。 为了做出知情的决定,人们需要了解长距离背景和其他车辆的重要性。 在这项工作中,我们提议使用视野变换器(Viet)学习城市环境中的驱动政策,使用鸟类眼视输入图像(BEV)来学习。 ViT网络比早先提议的革命神经网络(Convil Nets)更有效地了解全球背景。此外,ViT的注意机制有助于了解场景的注意地图,使自驾驶汽车能够确定哪些汽车对其下一个决定很重要。我们证明,一个带有Viet主干网主干线的DQN代理可以以各种方式预先训练的ConvNet主干线(ConveNet主干线)的基线算法。特别是,拟议的方法有助于强化学习算法,以更快的速度学习,提高性能,减少基线数据。