In this paper, we present a novel recurrent multi-view stereo network based on long short-term memory (LSTM) with adaptive aggregation, namely AA-RMVSNet. We firstly introduce an intra-view aggregation module to adaptively extract image features by using context-aware convolution and multi-scale aggregation, which efficiently improves the performance on challenging regions, such as thin objects and large low-textured surfaces. To overcome the difficulty of varying occlusion in complex scenes, we propose an inter-view cost volume aggregation module for adaptive pixel-wise view aggregation, which is able to preserve better-matched pairs among all views. The two proposed adaptive aggregation modules are lightweight, effective and complementary regarding improving the accuracy and completeness of 3D reconstruction. Instead of conventional 3D CNNs, we utilize a hybrid network with recurrent structure for cost volume regularization, which allows high-resolution reconstruction and finer hypothetical plane sweep. The proposed network is trained end-to-end and achieves excellent performance on various datasets. It ranks $1^{st}$ among all submissions on Tanks and Temples benchmark and achieves competitive results on DTU dataset, which exhibits strong generalizability and robustness. Implementation of our method is available at https://github.com/QT-Zhu/AA-RMVSNet.
翻译:在本文中,我们提出了一个基于长期短期内存(LSTM)和适应性综合体的新颖的经常性多视图立体网络,即AA-RMVSNet。我们首先采用一个内视集成模块,通过使用环境觉变迁和多尺度集成,适应性地提取图像特征,从而有效地改善具有挑战性的区域,如薄物体和大低脂表面的性能。为了克服复杂场景中不同封闭的困难,我们提议了一个适应性像素智识综合体的跨视量量量集成模块,能够在所有观点中保存更相配的对。两个拟议的适应性集成模块在提高3D重建的准确性和完整性方面是轻量、有效和互补的。我们使用一个具有经常性结构的混合网络来调节成本数量,从而能够进行高分辨率的重建和更精细的假设性平面扫空。我们提议的网络在各种数据集中得到了培训,并在各种数据集中取得了出色的性能表现。它在所有关于坦克和寺庙基准/网络基准的所有提交材料中排名为1美元,在可比较性DRM-A-A-Q上实现可比较性的数据展示。