This paper proposes a network, referred to as MVSTR, for Multi-View Stereo (MVS). It is built upon Transformer and is capable of extracting dense features with global context and 3D consistency, which are crucial to achieving reliable matching for MVS. Specifically, to tackle the problem of the limited receptive field of existing CNN-based MVS methods, a global-context Transformer module is first proposed to explore intra-view global context. In addition, to further enable dense features to be 3D-consistent, a 3D-geometry Transformer module is built with a well-designed cross-view attention mechanism to facilitate inter-view information interaction. Experimental results show that the proposed MVSTR achieves the best overall performance on the DTU dataset and strong generalization on the Tanks & Temples benchmark dataset.
翻译:本文提议建立一个多视立体网络,称为多视立体立体网络,它以变异器为基础,能够提取具有全球背景和三维一致性的密集特征,这对于使变异系统实现可靠的匹配至关重要。具体地说,为了解决现有有线电视新闻网的变异系统方法的有限可接收领域问题,首先提议建立一个全球变异器模块,以探索全视全球背景。此外,为了进一步使密度特征达到3D一致性,3D测地变异器模块与设计完善的交叉关注机制一起建立,以便利不同视角的信息互动。实验结果表明,拟议的变异系统在DTU数据集上取得了最佳的总体业绩,并在Tanks & Temples基准数据集上实现了强力概括化。