Despite impressive performance for high-level downstream tasks, self-supervised pre-training methods have not yet fully delivered on dense geometric vision tasks such as stereo matching. The application of self-supervised learning concepts, such as instance discrimination or masked image modeling, to geometric tasks is an active area of research. In this work we build on the recent cross-view completion framework: this variation of masked image modeling leverages a second view from the same scene, which is well suited for binocular downstream tasks. However, the applicability of this concept has so far been limited in at least two ways: (a) by the difficulty of collecting real-world image pairs - in practice only synthetic data had been used - and (b) by the lack of generalization of vanilla transformers to dense downstream tasks for which relative position is more meaningful than absolute position. We explore three avenues of improvement: first, we introduce a method to collect suitable real-world image pairs at large scale. Second, we experiment with relative positional embeddings and demonstrate that they enable vision transformers to perform substantially better. Third, we scale up vision transformer based cross-completion architectures, which is made possible by the use of large amounts of data. With these improvements, we show for the first time that state-of-the-art results on deep stereo matching can be reached without using any standard task-specific techniques like correlation volume, iterative estimation or multi-scale reasoning.
翻译:尽管在高级别下游任务上取得了令人印象深刻的业绩,但自我监督的培训前方法尚未完全用于诸如立体匹配等密集的几何视觉任务。将自我监督的学习概念,例如实例歧视或蒙面图像建模,应用于几何任务是一个积极的研究领域。在这项工作中,我们以最近的交叉观点完成框架为基础:蒙面图像建模的这种变异利用同一场景的第二个观点,这非常适合双筒望远镜下游任务。然而,这一概念的适用性迄今至少在两个方面受到限制:(a) 难以收集真实世界的成对(实际上只使用了合成数据)以及(b) 缺乏香草变异器对密集的下游任务的一般化,而相对位置比绝对位置更有意义。我们探讨三个改进途径:首先,我们引入一种方法,从大尺度收集适合真实世界的相配对。第二,我们实验相对定位嵌入,并表明它们能够使视觉变异的成更佳得多。第三,我们利用基于深度变异性变异的图像变异性模型,在大规模的跨级结构中,我们利用了这些变异性造型的造型技术,这是可能实现的跨级的定型造型造型的定结果。