3D object detection is vital for many robotics applications. For tasks where a 2D perspective range image exists, we propose to learn a 3D representation directly from this range image view. To this end, we designed a 2D convolutional network architecture that carries the 3D spherical coordinates of each pixel throughout the network. Its layers can consume any arbitrary convolution kernel in place of the default inner product kernel and exploit the underlying local geometry around each pixel. We outline four such kernels: a dense kernel according to the bag-of-words paradigm, and three graph kernels inspired by recent graph neural network advances: the Transformer, the PointNet, and the Edge Convolution. We also explore cross-modality fusion with the camera image, facilitated by operating in the perspective range image view. Our method performs competitively on the Waymo Open Dataset and improves the state-of-the-art AP for pedestrian detection from 69.7% to 75.5%. It is also efficient in that our smallest model, which still outperforms the popular PointPillars in quality, requires 180 times fewer FLOPS and model parameters
翻译:3D 对象检测对于许多机器人应用至关重要。 对于存在 2D 角度范围图像的任务, 我们建议直接从此范围图像视图中学习 3D 代表 。 为此, 我们设计了 2D 进化网络结构, 包含整个网络中每个像素的 3D 球座坐标。 其层可以消耗任意的进化内核内核, 取代默认的内产品内核, 并利用每个像素周围的本地几何。 我们勾画了四个这样的内核: 根据字包范式, 一个密集的内核, 以及三个受最近图形神经网络进步启发的图形内核: 变形器、 点网 和 电磁共振。 我们还探索与相机图像的交叉模式融合, 并在视觉范围图像视图中操作。 我们的方法在Waymo Open Data数据集上具有竞争力, 并改进行人行探测所需的最先进的AP 状态, 从69. 7% 到 75. 5% 。 在最小的模型中也有效, 它仍然比普通Pillas 和低180 参数要求 180 。