Recent works in autonomous driving have widely adopted the bird's-eye-view (BEV) semantic map as an intermediate representation of the world. Online prediction of these BEV maps involves non-trivial operations such as multi-camera data extraction as well as fusion and projection into a common top-view grid. This is usually done with error-prone geometric operations (e.g., homography or back-projection from monocular depth estimation) or expensive direct dense mapping between image pixels and pixels in BEV (e.g., with MLP or attention). In this work, we present 'LaRa', an efficient encoder-decoder, transformer-based model for vehicle semantic segmentation from multiple cameras. Our approach uses a system of cross-attention to aggregate information over multiple sensors into a compact, yet rich, collection of latent representations. These latent representations, after being processed by a series of self-attention blocks, are then reprojected with a second cross-attention in the BEV space. We demonstrate that our model outperforms on nuScenes the best previous works using transformers.
翻译:最近自主驾驶工程已广泛采用鸟眼视语义图(BEV)作为世界的中间表示。这些BEV地图的在线预测涉及非三角操作,如多相机数据提取以及聚合和投射到共同的顶视图网格中。这通常与易出错的几何操作(如单层深度估计的同影或反射)或BEV图像像素和像素(如MLP或注意)之间的高密度直接测绘有关(如MLP或注意)有关。在这项工作中,我们展示了“LaRa”,一种高效的解码器、基于变异器的模型,用于从多个相机中提取车辆的语义分解。我们的方法是使用一种交叉注意系统,将多个传感器的信息汇总成一个紧凑但丰富的潜伏图。这些潜伏图在经过一系列自留区处理后,又用BEV空间的第二次交叉保护进行重新预测。我们用变压器展示了我们模型在以前最佳的变压器上的外形模型。