Self-supervised depth learning from monocular images normally relies on the 2D pixel-wise photometric relation between temporally adjacent image frames. However, they neither fully exploit the 3D point-wise geometric correspondences, nor effectively tackle the ambiguities in the photometric warping caused by occlusions or illumination inconsistency. To address these problems, this work proposes Density Volume Construction Network (DevNet), a novel self-supervised monocular depth learning framework, that can consider 3D spatial information, and exploit stronger geometric constraints among adjacent camera frustums. Instead of directly regressing the pixel value from a single image, our DevNet divides the camera frustum into multiple parallel planes and predicts the pointwise occlusion probability density on each plane. The final depth map is generated by integrating the density along corresponding rays. During the training process, novel regularization strategies and loss functions are introduced to mitigate photometric ambiguities and overfitting. Without obviously enlarging model parameters size or running time, DevNet outperforms several representative baselines on both the KITTI-2015 outdoor dataset and NYU-V2 indoor dataset. In particular, the root-mean-square-deviation is reduced by around 4% with DevNet on both KITTI-2015 and NYU-V2 in the task of depth estimation. Code is available at https://github.com/gitkaichenzhou/DevNet.
翻译:从单层图像中进行自我监督的深度学习通常依赖于在时间相邻的图像框架之间2D像素光度测量关系。 但是,它们既没有充分利用三维点对地对等的几何对应,也没有有效地解决隔离或光化不一致导致的光度扭曲的模糊性。 为了解决这些问题, 这项工作提出了调频量量构建网络( DevNet), 这是一个全新的自我监督的单层深度学习框架, 它可以考虑 3D 空间信息, 并在相邻的摄像头中利用更强的几何限制。 我们的DevNet没有直接从单一的图像中反向像素值, 而是将相机的透视分为多个平行的平面, 并预测每平面上的点超度概率密度。 最后的深度地图是通过将相应射线的密度整合而生成的。 在培训过程中, 引入新的规范化战略和损失功能来减轻光度模糊性和超标度。 在不明显扩大模型参数大小或运行时间的情况下, DevNet在 KITTI-2015 室内数据设置和 NY-V-V2 深度的Groma-V-qi- disal 上, Kreal- dis- dis- dis- dis- dis- dis- dis- dis- dis- dis- dis