Monocular 3D object detection task aims to predict the 3D bounding boxes of objects based on monocular RGB images. Since the location recovery in 3D space is quite difficult on account of absence of depth information, this paper proposes a novel unified framework which decomposes the detection problem into a structured polygon prediction task and a depth recovery task. Different from the widely studied 2D bounding boxes, the proposed novel structured polygon in the 2D image consists of several projected surfaces of the target object. Compared to the widely-used 3D bounding box proposals, it is shown to be a better representation for 3D detection. In order to inversely project the predicted 2D structured polygon to a cuboid in the 3D physical world, the following depth recovery task uses the object height prior to complete the inverse projection transformation with the given camera projection matrix. Moreover, a fine-grained 3D box refinement scheme is proposed to further rectify the 3D detection results. Experiments are conducted on the challenging KITTI benchmark, in which our method achieves state-of-the-art detection accuracy.
翻译:由于缺少深度信息,3D空间的位置恢复非常困难,本文件提出一个新的统一框架,将探测问题分解成结构化多边形预测任务和深度恢复任务。与广泛研究的 2D 捆绑框不同,2D 图像中拟议的新型结构化多边形由目标物体的若干预测表面组成。与广泛使用的3D 捆绑框提案相比,它被证明是3D 探测的更好体现。为了反向预测3D 物理世界中预测的2D结构多边形到幼虫,以下深度回收任务使用天体高度,以完成与特定摄像头投影矩阵的反向投影变形。此外,还提议了一个精细的3D 框改进计划,以进一步纠正3D 探测结果。在具有挑战性的 KITTI 基准上进行了实验,我们的方法在这个基准中实现了最新技术检测的准确性。