Perceiving 3D information is of paramount importance in many applications of computer vision. Recent advances in monocular depth estimation have shown that gaining such knowledge from a single camera input is possible by training deep neural networks to predict inverse depth and pose, without the necessity of ground truth data. The majority of such approaches, however, require camera parameters to be fed explicitly during training. As a result, image sequences from wild cannot be used during training. While there exist methods which also predict camera intrinsics, their performance is not on par with novel methods taking camera parameters as input. In this work, we propose a method for implicit estimation of pinhole camera intrinsics along with depth and pose, by learning from monocular image sequences alone. In addition, by utilizing efficient sub-pixel convolutions, we show that high fidelity depth estimates can be obtained. We also embed pixel-wise uncertainty estimation into the framework, to emphasize the possible applicability of this work in practical domain. Finally, we demonstrate the possibility of accurate prediction of depth information without prior knowledge of camera intrinsics, while outperforming the existing state-of-the-art approaches on KITTI benchmark.
翻译:在计算机视觉的许多应用中,感知三维信息至关重要。最近单向深度估计的进展表明,通过训练深神经网络,在不需要地面真象数据的情况下,通过训练深神经网络来预测反深度和显示,从单一摄像器输入获得这种知识是可能的。但是,大多数这类方法都需要在训练期间明确提供摄像参数。因此,在训练期间不能使用野生图像序列。虽然有方法也预测照相机的内在特征,但其性能与以照相机参数作为输入的新方法不同。在这项工作中,我们建议一种方法,通过单从单方图像序列中学习,隐含地估计针孔照相机固有的内涵和深度和外表。此外,我们通过利用高效的次像素变相,表明可以获得高正度深度估计。我们还将精度的不确定性估计嵌入框架,以强调这项工作在实际领域可能适用。最后,我们证明,在不事先了解照相机内在特性的情况下,可以准确预测深度信息,同时进行深度和深度和深度的测深深处,同时,我们无法评估目前对KITTI基准采用的最新方法。