Generating a detailed near-field perceptual model of the environment is an important and challenging problem in both self-driving vehicles and autonomous mobile robotics. A Bird Eye View (BEV) map, providing a panoptic representation, is a commonly used approach that provides a simplified 2D representation of the vehicle surroundings with accurate semantic level segmentation for many downstream tasks. Current state-of-the art approaches to generate BEV-maps employ a Convolutional Neural Network (CNN) backbone to create feature-maps which are passed through a spatial transformer to project the derived features onto the BEV coordinate frame. In this paper, we evaluate the use of vision transformers (ViT) as a backbone architecture to generate BEV maps. Our network architecture, ViT-BEVSeg, employs standard vision transformers to generate a multi-scale representation of the input image. The resulting representation is then provided as an input to a spatial transformer decoder module which outputs segmentation maps in the BEV grid. We evaluate our approach on the nuScenes dataset demonstrating a considerable improvement in the performance relative to state-of-the-art approaches.
翻译:在自驾驶飞行器和自主移动机器人中,产生详细的近地环境概念模型是一个重要而具有挑战性的问题。 鸟眼视图(BEV)地图,提供全光显示,是一种常用的方法,为许多下游任务提供一个简化的车辆周围2D代表,具有准确的语系分层。目前,生成BEV-映射的先进方法采用一个动态神经网络主干线,以创建通过空间变压器将衍生的特征投射到BEV协调框架的地貌图。在本文件中,我们评估了作为生成BEV地图主干结构的视觉变压器(VEVT)的使用情况。我们的网络结构,ViT-BEVSeg,使用标准视觉变压器来生成多尺度的输入图像代表。随后,为空间变压器解码模块提供了一种输入BEV网格中分解图的功能。我们评估了我们关于NEVEVD数据集的方法,表明相对于状态方法的性能有相当大的改进。