A self-driving perception model aims to extract 3D semantic representations from multiple cameras collectively into the bird's-eye-view (BEV) coordinate frame of the ego car in order to ground downstream planner. Existing perception methods often rely on error-prone depth estimation of the whole scene or learning sparse virtual 3D representations without the target geometry structure, both of which remain limited in performance and/or capability. In this paper, we present a novel end-to-end architecture for ego 3D representation learning from an arbitrary number of unconstrained camera views. Inspired by the ray tracing principle, we design a polarized grid of "imaginary eyes" as the learnable ego 3D representation and formulate the learning process with the adaptive attention mechanism in conjunction with the 3D-to-2D projection. Critically, this formulation allows extracting rich 3D representation from 2D images without any depth supervision, and with the built-in geometry structure consistent w.r.t. BEV. Despite its simplicity and versatility, extensive experiments on standard BEV visual tasks (e.g., camera-based 3D object detection and BEV segmentation) show that our model outperforms all state-of-the-art alternatives significantly, with an extra advantage in computational efficiency from multi-task learning.
翻译:自我驱动感知模型旨在从多个摄像头中将3D语义表达方式统统从多个摄像头中提取到自利汽车(BEV)协调框架的鸟眼镜(BEV)协调框中,以在下游规划器下方降落。现有的感知方法往往依赖于对整个场景的易出深度估计,或者在没有目标几何结构的情况下学习零散的虚拟 3D 表达方式,两者在性能和(或)能力方面都仍然有限。在本文中,我们展示了一个全新的自我3D 代表结构,从任意数量的不受限制的摄像学观点中学习。在射线追踪原则的启发下,我们设计了一个“视像眼”对齐化的网格,作为可学习的自我3D代表方式,并与3D至2D投影的适应性关注机制一起设计学习过程。关键的是,这种配有2D图像中丰富的3D代表形式,而没有任何深度监督,而且建筑中的地理测量结构结构与w.r.t.BEV.t.t.t.尽管其简单和多功能性,但在标准的视觉任务中进行广泛的实验(e.gram3D 3D 3D物体探测和BEV-travel-travelde-traveldudustrtal del),显示所有模型的模型的模型的多重计算方法都显示了我们所有的不甚高分解的模型。