3D object detection in autonomous driving aims to reason "what" and "where" the objects of interest present in a 3D world. Following the conventional wisdom of previous 2D object detection, existing methods often adopt the canonical Cartesian coordinate system with perpendicular axis. However, we conjugate that this does not fit the nature of the ego car's perspective, as each onboard camera perceives the world in shape of wedge intrinsic to the imaging geometry with radical (non-perpendicular) axis. Hence, in this paper we advocate the exploitation of the Polar coordinate system and propose a new Polar Transformer (PolarFormer) for more accurate 3D object detection in the bird's-eye-view (BEV) taking as input only multi-camera 2D images. Specifically, we design a cross attention based Polar detection head without restriction to the shape of input structure to deal with irregular Polar grids. For tackling the unconstrained object scale variations along Polar's distance dimension, we further introduce a multi-scalePolar representation learning strategy. As a result, our model can make best use of the Polar representation rasterized via attending to the corresponding image observation in a sequence-to-sequence fashion subject to the geometric constraints. Thorough experiments on the nuScenes dataset demonstrate that our PolarFormer outperforms significantly state-of-the-art 3D object detection alternatives.
翻译:在自动驾驶中, 3D 对象检测旨在解释三维世界中存在的利益对象“ 是什么” 和“ 在哪里” 。 根据以往2D 对象检测的传统智慧, 现有方法通常会采用带有垂直轴轴的卡通卡泰斯协调系统。 然而, 我们想象这不符合自利汽车观点的性质, 因为机上每个摄像头都用极( 非垂直)轴来看待成像几何结构所固有的世界。 因此, 在本文中, 我们提倡利用极地协调系统, 并提出一个新的极地变换器( Pollar Former ), 以便在鸟眼视图( BEV) 中, 用于更精确的 3D 对象检测系统。 仅将多摄像2D 图像作为输入输入。 具体地, 我们设计一个基于极地表检测头的交叉关注, 不限制输入结构, 处理不规则的极地格。 为了处理极地( 非视界) 的未受限制的物体比例变化, 我们进一步引入一个多级的波拉尔代表学习策略。 结果, 我们的模型可以最佳利用极地点探测对象的立变量 演示模型, 测试模型的模型, 将 演示的代数级代表系统演示数据序列演示演示演示。