Stereo depth estimation relies on optimal correspondence matching between pixels on epipolar lines in the left and right images to infer depth. In this work, we revisit the problem from a sequence-to-sequence correspondence perspective to replace cost volume construction with dense pixel matching using position information and attention. This approach, named STereo TRansformer (STTR), has several advantages: It 1) relaxes the limitation of a fixed disparity range, 2) identifies occluded regions and provides confidence estimates, and 3) imposes uniqueness constraints during the matching process. We report promising results on both synthetic and real-world datasets and demonstrate that STTR generalizes across different domains, even without fine-tuning.
翻译:在这项工作中,我们从顺序到顺序的通信角度重新审视问题,以使用位置信息和注意力用密集像素来取代成本体积的构造。这个名为STEREO TRANSEXEN(STTR)的方法有几个优点:1) 放松固定差异范围的限制,2) 查明隐蔽区域并提供信任估计,3) 在匹配过程中施加独特性限制。我们报告合成和真实世界数据集的有希望的结果,并证明STTR对不同领域进行概括,即使没有微调。