The self-attention mechanism, successfully employed with the transformer structure is shown promise in many computer vision tasks including image recognition, and object detection. Despite the surge, the use of the transformer for the problem of stereo matching remains relatively unexplored. In this paper, we comprehensively investigate the use of the transformer for the problem of stereo matching, especially for laparoscopic videos, and propose a new hybrid deep stereo matching framework (HybridStereoNet) that combines the best of the CNN and the transformer in a unified design. To be specific, we investigate several ways to introduce transformers to volumetric stereo matching pipelines by analyzing the loss landscape of the designs and in-domain/cross-domain accuracy. Our analysis suggests that employing transformers for feature representation learning, while using CNNs for cost aggregation will lead to faster convergence, higher accuracy and better generalization than other options. Our extensive experiments on Sceneflow, SCARED2019 and dVPN datasets demonstrate the superior performance of our HybridStereoNet.
翻译:成功使用变压器结构的自我关注机制在许多计算机视觉任务(包括图像识别和物体探测)中显示出了前景。尽管出现剧增,使用变压器解决立体匹配问题仍然相对没有探索。在本文件中,我们全面调查变压器用于立体匹配问题的情况,特别是用于大肠杆菌视频的问题,并提议一个新的混合深层立体匹配框架(HybridStereoNet),它将CNN和变压器的最佳组合在一个统一的设计中。具体地说,我们通过分析设计的损失情况和内部/跨部准确性,研究将变压器引入体立体匹配管道的若干方法。我们的分析表明,使用变压器进行特征描述学习,同时使用CNN的成本汇总将更快地趋近、更精确和更概括化。我们在Scenepropolow、SCARD2019和DVPN数据集方面的广泛实验显示了我们混合网的优异性表现。