Feature representation learning is the key recipe for learning-based Multi-View Stereo (MVS). As the common feature extractor of learning-based MVS, vanilla Feature Pyramid Networks (FPNs) suffer from discouraged feature representations for reflection and texture-less areas, which limits the generalization of MVS. Even FPNs worked with pre-trained Convolutional Neural Networks (CNNs) fail to tackle these issues. On the other hand, Vision Transformers (ViTs) have achieved prominent success in many 2D vision tasks. Thus we ask whether ViTs can facilitate feature learning in MVS? In this paper, we propose a pre-trained ViT enhanced MVS network called MVSFormer, which can learn more reliable feature representations benefited by informative priors from ViT. The finetuned MVSFormer with hierarchical ViTs of efficient attention mechanisms can achieve prominent improvement based on FPNs. Besides, the alternative MVSFormer with frozen ViT weights is further proposed. This largely alleviates the training cost with competitive performance strengthened by the attention map from the self-distillation pre-training. MVSFormer can be generalized to various input resolutions with efficient multi-scale training strengthened by gradient accumulation. Moreover, we discuss the merits and drawbacks of classification and regression-based MVS methods, and further propose to unify them with a temperature-based strategy. MVSFormer achieves state-of-the-art performance on the DTU dataset. Particularly, MVSFormer ranks as Top-1 on both intermediate and advanced sets of the highly competitive Tanks-and-Temples leaderboard.
翻译:以学习为基础的多视系统(MVS)的特征学习是学习性能学习的多视系统(MVS)的关键路由。由于学习性能学习的MVS(VVT)的共同特征提取器,Vanilla Fature Pyramid网络(FPNS)在反思和无纹质地区出现不乐观的特征展示,限制了MVS的普及。即使是FPNS在预先培训的革命神经网络(NNS)中也未能解决这些问题。另一方面,愿景变换器(VVTs)在许多2D愿景任务中取得了显著的成功。因此,我们询问VTs能否为MVS的特征学习提供便利?在本文件中,我们提议建立一个事先经过训练的VIT强化的MVS网络,这个网络可以学习更可靠的特征展示,因为VITS的素素素素素素素素素素素素素素素素素素素素素素素素素素素素养。 微的MVS(NVS)在高温尔基)的温度下另制下,进一步提议以冷凝固的另制数据。这大大降低变压下的培训成本测试成本成本成本成本成本,我们通过强化的升级的升级的升级的学习,通过强化的升级的学习来讨论。