Learning Bird's Eye View (BEV) representation from surrounding-view cameras is of great importance for autonomous driving. In this work, we propose a Geometry-guided Kernel Transformer (GKT), a novel 2D-to-BEV representation learning mechanism. GKT leverages the geometric priors to guide the transformer to focus on discriminative regions and unfolds kernel features to generate BEV representation. For fast inference, we further introduce a look-up table (LUT) indexing method to get rid of the camera's calibrated parameters at runtime. GKT can run at $72.3$ FPS on 3090 GPU / $45.6$ FPS on 2080ti GPU and is robust to the camera deviation and the predefined BEV height. And GKT achieves the state-of-the-art real-time segmentation results, i.e., 38.0 mIoU (100m$\times$100m perception range at a 0.5m resolution) on the nuScenes val set. Given the efficiency, effectiveness, and robustness, GKT has great practical values in autopilot scenarios, especially for real-time running systems. Code and models will be available at \url{https://github.com/hustvl/GKT}.
翻译:在这项工作中,我们提议了一个2D到BEV代表制的新型2D到BEV代表制机制。 GKT利用几何前缀来引导变压器关注歧视地区,并展示内核特征来生成BEV代表制。为了快速推断,我们进一步引入了一个查找表(LUT)索引方法,以便在运行时清除相机的校准参数。GKT可以在 3090 GPU / 45.6$ FPS 上运行72.3$ FPS, 在 2080ti GPU 上运行, 并且对摄像器偏差和预定义的BEV高度具有很强的功能。 GKT 实现了最先进的实时分化结果, 即38.0 mIOU(100万美元\timex100m), 在0.5m分辨率上清除相机校准参数。GKKT可以运行72.3$FPS, 3090 gPS / 45.6$FPS, 在实时、效能、效能和动态模型中,GKT将拥有巨大的自动模型。