Photometric differences are widely used as supervision signals to train neural networks for estimating depth and camera pose from unlabeled monocular videos. However, this approach is detrimental for model optimization because occlusions and moving objects in a scene violate the underlying static scenario assumption. In addition, pixels in textureless regions or less discriminative pixels hinder model training. To solve these problems, in this paper, we deal with moving objects and occlusions utilizing the difference of the flow fields and depth structure generated by affine transformation and view synthesis, respectively. Secondly, we mitigate the effect of textureless regions on model optimization by measuring differences between features with more semantic and contextual information without adding networks. In addition, although the bidirectionality component is used in each sub-objective function, a pair of images are reasoned about only once, which helps reduce overhead. Extensive experiments and visual analysis demonstrate the effectiveness of the proposed method, which outperform existing state-of-the-art self-supervised methods under the same conditions and without introducing additional auxiliary information.
翻译:光度差异被广泛用作监督信号,用于培养神经网络,以估计深度,而相机则由未贴标签的单体视频显示。然而,这一方法不利于模型优化,因为场景中的隔离和移动物体违反了基本静态假设。此外,无纹理区域或较少歧视像素中的像素阻碍了模型培训。为了解决这些问题,本文件利用亲吻变形和视觉合成分别产生的流动字段和深度结构差异分别处理移动对象和隔离问题。第二,我们通过测量具有更多语义和背景信息的特征之间的差异,而不增加网络,减轻无纹区域对模型优化的影响。此外,尽管在每个次级目标功能中使用双向性部分,但一对图像只说明一次,这有助于减少间接费用。广泛的实验和视觉分析表明拟议方法的效力,该方法在同样条件下超越了现有最先进的自我监督方法,而且没有引入其他辅助信息。