Current self-supervised monocular depth estimation methods are mostly based on estimating a rigid-body motion representing camera motion. These methods suffer from the well-known scale ambiguity problem in their predictions. We propose DepthP+P, a method that learns to estimate outputs in metric scale by following the traditional planar parallax paradigm. We first align the two frames using a common ground plane which removes the effect of the rotation component in the camera motion. With two neural networks, we predict the depth and the camera translation, which is easier to predict alone compared to predicting it together with rotation. By assuming a known camera height, we can then calculate the induced 2D image motion of a 3D point and use it for reconstructing the target image in a self-supervised monocular approach. We perform experiments on the KITTI driving dataset and show that the planar parallax approach, which only needs to predict camera translation, can be a metrically accurate alternative to the current methods that rely on estimating 6DoF camera motion.
 翻译:目前自我监督的单心深度估计方法大多以估计代表相机运动的硬体运动为基础。 这些方法在预测中存在众所周知的规模模糊问题。 我们提议了深度P+P, 这是一种通过遵循传统的平面抛光法范式, 学会用量度估计输出值的方法。 我们首先使用一个通用地面平面对两个框架进行对齐, 该平面可以消除相机运动中旋转部分的影响。 我们通过两个神经网络, 我们预测深度和相机翻译, 这比预测与旋转相比更容易单独预测。 如果假设一个已知的摄像头高度, 我们就可以计算出一个3D点的诱导 2D 图像运动, 然后用它来用一个自我监督的单望远镜方法重建目标图像。 我们在 KITTI 驱动数据集上进行实验, 并显示, 平面对准视距法方法, 只需预测相机翻译, 就可以作为当前方法的一种非常精确的替代方法, 即估计 6DF 摄像机动作。