Monocular 3D object detection is a fundamental but very important task to many applications including autonomous driving, robotic grasping and augmented reality. Existing leading methods tend to estimate the depth of the input image first, and detect the 3D object based on point cloud. This routine suffers from the inherent gap between depth estimation and object detection. Besides, the prediction error accumulation would also affect the performance. In this paper, a novel method named MonoSIM is proposed. The insight behind introducing MonoSIM is that we propose to simulate the feature learning behaviors of a point cloud based detector for monocular detector during the training period. Hence, during inference period, the learned features and prediction would be similar to the point cloud based detector as possible. To achieve it, we propose one scene-level simulation module, one RoI-level simulation module and one response-level simulation module, which are progressively used for the detector's full feature learning and prediction pipeline. We apply our method to the famous M3D-RPN detector and CaDDN detector, conducting extensive experiments on KITTI and Waymo Open datasets. Results show that our method consistently improves the performance of different monocular detectors for a large margin without changing their network architectures. Our codes will be publicly available at https://github.com/sunh18/MonoSIM}{https://github.com/sunh18/MonoSIM.
翻译:对许多应用而言,包括自主驾驶、机器人掌握和扩大现实,对自动驾驶、机器人掌握和扩大现实而言,单体3D天体探测是一项根本性但非常重要的任务。 现有的引导方法倾向于首先估计输入图像的深度,然后根据点云探测3D天体。 这一例行工作存在深度估计和天体探测之间的内在差距。 此外, 预测错误积累也会影响性能。 在本文中, 提出了一个名为 MonoSIM 的新方法。 引入 MonoSIM 的背后的洞察力是, 我们提议在培训期间模拟基于点云探测器的特征学习行为, 用于单体探测器。 因此, 在推断期间, 所学到的特征和预测将类似于基于点云的探测器。 为了实现这一点, 我们提议了一个现场级模拟模块, 一个 RoI 级模拟模块和一个反应级模拟模块, 将逐渐用于探测器的全部特征学习和预测管道。 我们用的方法用于著名的 M3D- RPN 探测器和 CADDN 探测器, 对 KITTI 和 Waymo Open 数据集进行广泛的实验。 结果显示, 我们的方法将持续改进了我们现有不同镜IM/MIM 的大型网络结构的功能。