While 2D object detection has improved significantly over the past, real world applications of computer vision often require an understanding of the 3D layout of a scene. Many recent approaches to 3D detection use LiDAR point clouds for prediction. We propose a method that only uses a single RGB image, thus enabling applications in devices or vehicles that do not have LiDAR sensors. By using an RGB image, we can leverage the maturity and success of recent 2D object detectors, by extending a 2D detector with a 3D detection head. In this paper we discuss different approaches and experiments, including both regression and classification methods, for designing this 3D detection head. Furthermore, we evaluate how subproblems and implementation details impact the overall prediction result. We use the KITTI dataset for training, which consists of street traffic scenes with class labels, 2D bounding boxes and 3D annotations with seven degrees of freedom. Our final architecture is based on Faster R-CNN. The outputs of the convolutional backbone are fixed sized feature maps for every region of interest. Fully connected layers within the network head then propose an object class and perform 2D bounding box regression. We extend the network head by a 3D detection head, which predicts every degree of freedom of a 3D bounding box via classification. We achieve a mean average precision of 47.3% for moderately difficult data, measured at a 3D intersection over union threshold of 70%, as required by the official KITTI benchmark; outperforming previous state-of-the-art single RGB only methods by a large margin.
翻译:虽然2D天体探测在过去有了显著的改善,但计算机视觉的真正世界应用往往需要了解3D的场景布局。许多最近的3D探测方法使用LiDAR点云云进行预测。我们建议使用一种方法,仅使用单一的 RGB 图像,从而能够在没有 LiDAR 传感器的装置或车辆中应用。通过使用 RGB 图像,我们可以利用最近的2D天体探测器的成熟度和成功率,扩大一个带有3D 探测头的2D 探测器。在本文中,我们讨论设计3D 探测头的各种方法和实验,包括回归和分类方法。此外,我们评估了3D 分辨的子题和执行细节如何影响总体预测结果。我们使用KITTI 数据集来进行培训,该数据集包括带有班级标签、 2D 捆绑框和 3D 说明七度的自由度。我们的最后结构以更快的 R- CN 为基础。 卷骨的输出是每个感兴趣的区域固定大小的地貌图。 网络内连接的层随后提出一个测量的物体级级,然后执行2D 的2D 绑定的大小, 标准框,通过每个方向的RD 基框,我们用平均的基点 将前端标标标标标 。