Monocular 3D object detection has recently shown promising results, however there remain challenging problems. One of those is the lack of invariance to different camera intrinsic parameters, which can be observed across different 3D object datasets. Little effort has been made to exploit the combination of heterogeneous 3D object datasets. In contrast to general intuition, we show that more data does not automatically guarantee a better performance, but rather, methods need to have a degree of 'camera independence' in order to benefit from large and heterogeneous training data. In this paper we propose a category-level pose estimation method based on instance segmentation, using camera independent geometric reasoning to cope with the varying camera viewpoints and intrinsics of different datasets. Every pixel of an instance predicts the object dimensions, the 3D object reference points projected in 2D image space and, optionally, the local viewing angle. Camera intrinsics are only used outside of the learned network to lift the predicted 2D reference points to 3D. We surpass camera independent methods on the challenging KITTI3D benchmark and show the key benefits compared to camera dependent methods.
翻译:显性 3D 对象探测最近显示了令人乐观的结果,然而,仍然存在着挑战性的问题。其中之一是对不同相机内在参数缺乏差异性,这些参数可以在不同的 3D 对象数据集中观察到。我们很少努力利用各种 3D 对象数据集的组合。与一般直觉相反,我们显示,更多的数据并不能自动保证更好的性能,而是方法需要一定的“camera 独立性”,才能从大型和多变的培训数据中受益。在本文中,我们提议了一种基于实例分割的类别一级显示图像估计方法,利用相机独立的几何推理来应对不同的相机观点和不同数据集的内在特征。每个实例的像素都预测了天体的尺寸,在 2D 图像空间中预测的 3D 对象参照点, 以及可选的当地查看角度。 相机的内在功能只在所学网络之外使用, 才能将预测的 2D 参考点提高到 3D 。我们在具有挑战性的 KITTI3D 基准上超越了独立相机的方法, 并展示了与相机依赖的方法相比的关键效益。