Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, One Bounding Box Multiple Objects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are $\mathbf{1.82\sim 10.91\%}$ mAP in BEV and $\mathbf{1.18\sim 9.36\%}$ mAP in 3D}. Codes have been released at https://github.com/mrsempress/OBMO.
翻译:与典型的多传感器系统相比,单色 3D 对象探测因其简单配置而引起人们的极大关注。 然而, 以LiDAR 为基础的单色图像和以单色为主的方法之间仍然存在着巨大的差距。 在本文中, 我们发现单色图像的错误性质可能导致深度模糊。 具体地说, 不同深度的物体可以使用相同的捆绑框和2D 图像中类似的视觉特征出现。 不幸的是, 网络无法准确区分不同深度与这种无差异的视觉特征之间的不同深度, 导致深度培训不稳定。 为了便利深度学习, 我们提议了一个简单而有效的插座和游戏模块, 一个响亮的盒子多对象( OBMO ) 。 具体地说, 我们添加了一套合适的假的假标签, 通过相同的框框框框框, 我们仔细设计了两个标签评分战略来代表它们的质量。 与原始的硬深度标签相比, 这种带有质量评分的软假标签使得网络能够学习一个合理的深度范围, 提升培训稳定性, 并由此改进了最终的测试标准。 KID 3ROD 。