Monocular image-based 3D perception has become an active research area in recent years owing to its applications in autonomous driving. Approaches to monocular 3D perception including detection and tracking, however, often yield inferior performance when compared to LiDAR-based techniques. Through systematic analysis, we identified that per-object depth estimation accuracy is a major factor bounding the performance. Motivated by this observation, we propose a multi-level fusion method that combines different representations (RGB and pseudo-LiDAR) and temporal information across multiple frames for objects (tracklets) to enhance per-object depth estimation. Our proposed fusion method achieves the state-of-the-art performance of per-object depth estimation on the Waymo Open Dataset, the KITTI detection dataset, and the KITTI MOT dataset. We further demonstrate that by simply replacing estimated depth with fusion-enhanced depth, we can achieve significant improvements in monocular 3D perception tasks, including detection and tracking.
翻译:近年来,由于在自主驱动过程中的应用,基于单体图像的三维感知已经成为一个积极的研究领域。 单体三维感知(包括探测和跟踪)的方法,但与基于激光雷达的技术相比,其性能往往不如以激光雷达为基础的技术。通过系统分析,我们发现,单体深度估计准确度是约束性能的一个主要因素。我们以这一观察为动力,提出了一种多层次的聚合方法,将不同表象(RGB和伪LiDAR)和跨多框架的物体(轨迹)时间信息结合起来,以加强对每个物体的深度估计。我们提议的聚变法在“Waymo Open Dataset”、“KITTI”探测数据集和KITTI MOT数据集中达到了最佳的单体深度估计性能。我们进一步证明,只要用聚变深度取代估计的深度,我们就能够显著改进单体3D感知力任务,包括探测和跟踪。