Depth estimation from a stereo image pair has become one of the most explored applications in computer vision, with most of the previous methods relying on fully supervised learning settings. However, due to the difficulty in acquiring accurate and scalable ground truth data, the training of fully supervised methods is challenging. As an alternative, self-supervised methods are becoming more popular to mitigate this challenge. In this paper, we introduce the H-Net, a deep-learning framework for unsupervised stereo depth estimation that leverages epipolar geometry to refine stereo matching. For the first time, a Siamese autoencoder architecture is used for depth estimation which allows mutual information between the rectified stereo images to be extracted. To enforce the epipolar constraint, the mutual epipolar attention mechanism has been designed which gives more emphasis to correspondences of features which lie on the same epipolar line while learning mutual information between the input stereo pair. Stereo correspondences are further enhanced by incorporating semantic information to the proposed attention mechanism. More specifically, the optimal transport algorithm is used to suppress attention and eliminate outliers in areas not visible in both cameras. Extensive experiments on KITTI2015 and Cityscapes show that our method outperforms the state-ofthe-art unsupervised stereo depth estimation methods while closing the gap with the fully supervised approaches.
翻译:立体图像配对的深度估算已成为计算机视觉中探索最深的应用程序之一,大多数先前的方法都依赖于完全监督的学习环境。然而,由于很难获得准确和可扩缩的地面真实数据,因此培训完全监督的方法具有挑战性。作为替代方法,自我监督的方法越来越受欢迎,以缓解这一挑战。在本文中,我们引入了H-Net,这是一个未经监督的立体深度估算的深学习框架,利用上层几何测量法来完善立体匹配。更具体地说,首次使用Siamese自动coder结构进行深度估算,以便提取纠正的立体图像之间的相互信息。为了执行上层限制,设计了双极关注机制,共同关注机制更加强调位于同一上层线上的特征的对应,同时学习输入立体立体立体对之间的相互信息。由于将语义信息纳入拟议的关注机制, Stereo通信得到进一步加强。更具体地说,使用最佳的运输算法来抑制人们的关注,消除在两部都看不到的摄像头的区域内的外部。在全面封闭的深度估算方法上进行了广泛的实验,同时展示了KITTI2015年和城市的升级方法。