Despite impressive performance for high-level downstream tasks, self-supervised pre-training methods have not yet fully delivered on dense geometric vision tasks such as stereo matching or optical flow. The application of selfsupervised concepts, such as instance discrimination or masked image modeling, to geometric tasks is an active area of research. In this work, we build on the recent crossview completion framework, a variation of masked image modeling that leverages a second view from the same scene which makes it well suited for binocular downstream tasks. The applicability of this concept has so far been limited in at least two ways: (a) by the difficulty of collecting realworld image pairs -- in practice only synthetic data have been used -- and (b) by the lack of generalization of vanilla transformers to dense downstream tasks for which relative position is more meaningful than absolute position. We explore three avenues of improvement: first, we introduce a method to collect suitable real-world image pairs at large scale. Second, we experiment with relative positional embeddings and show that they enable vision transformers to perform substantially better. Third, we scale up vision transformer based cross-completion architectures, which is made possible by the use of large amounts of data. With these improvements, we show for the first time that stateof-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques like correlation volume, iterative estimation, image warping or multi-scale reasoning, thus paving the way towards universal vision models.
翻译:尽管在高级别下游任务上取得了令人印象深刻的业绩,但自我监督的培训前方法尚未完全用于诸如立体匹配或光学流等密集的几何视觉任务。将自我监督的概念(如实例歧视或蒙面图像建模)应用于几何任务是一个积极的研究领域。在这项工作中,我们以最近的交叉视图完成框架为基础,一种蒙面图像建模的变异,利用同一场景的第二个视图,使其适合于双曲线下游任务。这一概念的适用性迄今至少在两种方面受到限制:(a)由于难以收集真实世界的图像配对 -- -- 实际上只使用了合成数据 -- -- 以及(b)由于没有将香草变异变异器普遍应用到相对位置比绝对位置更有意义的密集下游任务上。我们探索了三种改进途径:首先,我们引入了一种方法,从同一个场景中收集合适的真实世界图像配对,从而使得它非常适合双曲线下游任务。第二,我们试验相对定位嵌入,并表明它们能够使视觉变异的组合工作大大改进。第三,我们将视野变异的推向通用的推向通用的推算,因此没有使用合成数据的合成数据,因此,因此我们利用基于大规模的跨曲线变变曲线造型结构的图式的图,从而展示了我们用了大量的跨级造型造型造型造型的造型的造型的造型的造型的造型的造型的造型技术,从而展示了我们算。</s>