We propose an efficient multi-view stereo (MVS) network for infering depth value from multiple RGB images. Recent studies have shown that mapping the geometric relationship in real space to neural network is an essential topic of the MVS problem. Specifically, these methods focus on how to express the correspondence between different views by constructing a nice cost volume. In this paper, we propose a more complete cost volume construction approach based on absorbing previous experience. First of all, we introduce the self-attention mechanism to fully aggregate the dominant information from input images and accurately model the long-range dependency, so as to selectively aggregate reference features. Secondly, we introduce the group-wise correlation to feature aggregation, which greatly reduces the memory and calculation burden. Meanwhile, this method enhances the information interaction between different feature channels. With this approach, a more lightweight and efficient cost volume is constructed. Finally we follow the coarse to fine strategy and refine the depth sampling range scale by scale with the help of uncertainty estimation. We further combine the previous steps to get the attention thin volume. Quantitative and qualitative experiments are presented to demonstrate the performance of our model.
翻译:我们建议建立一个高效的多视图立体(MVS)网络,从多个 RGB 图像中推断深度值。最近的研究表明,测绘实际空间的几何关系与神经网络的几何关系是MVS问题的一个基本主题。具体地说,这些方法侧重于如何通过构建一个高成本体积来表达不同观点之间的对应关系。在本文中,我们建议基于吸收以往经验的更完整的成本量构建方法。首先,我们引入自我注意机制,充分汇总输入图像中的主要信息,准确模拟长距离依赖性,以便有选择地综合参考特征。第二,我们引入群集的群集相关性,这极大地减少了记忆和计算负担。与此同时,这一方法加强了不同特征渠道之间的信息互动。用这种方法构建了一个更轻、更高效的成本量。最后,我们遵循粗略的策略,并根据不确定性估计来改进深度取样范围。我们进一步整合了先前的步骤,以吸引对薄体积的注意。我们介绍了定量和定性实验,以展示模型的性能。