We introduce a novel framework for training deep stereo networks effortlessly and without any ground-truth. By leveraging state-of-the-art neural rendering solutions, we generate stereo training data from image sequences collected with a single handheld camera. On top of them, a NeRF-supervised training procedure is carried out, from which we exploit rendered stereo triplets to compensate for occlusions and depth maps as proxy labels. This results in stereo networks capable of predicting sharp and detailed disparity maps. Experimental results show that models trained under this regime yield a 30-40% improvement over existing self-supervised methods on the challenging Middlebury dataset, filling the gap to supervised models and, most times, outperforming them at zero-shot generalization.
翻译:我们提出了一种新的框架,可以轻松地无需任何基础知识来训练深度立体网络。通过利用最先进的神经渲染解决方案,我们可以从使用单个手持相机拍摄的图像序列中生成立体训练数据。在此基础上,进行了NeRF guided监督培训过程,我们利用生成的立体三重奏来补偿遮挡并作为代理标签生成深度地图。这使得立体网络能够预测锐利并且具有细节的视差图。实验结果表明,按照这种方式训练的模型在具有挑战性的Middlebury数据集上比现有的自我监督方法有30-40%的改进,并弥补了有监督模型之间的差距,在零-shot泛化方面大多优于它们。