增强的稳定视角综合 (Enhanced Stable View Synthesis)

We introduce an approach to enhance the novel view synthesis from images taken from a freely moving camera. The introduced approach focuses on outdoor scenes where recovering accurate geometric scaffold and camera pose is challenging, leading to inferior results using the state-of-the-art stable view synthesis (SVS) method. SVS and related methods fail for outdoor scenes primarily due to (i) over-relying on the multiview stereo (MVS) for geometric scaffold recovery and (ii) assuming COLMAP computed camera poses as the best possible estimates, despite it being well-studied that MVS 3D reconstruction accuracy is limited to scene disparity and camera-pose accuracy is sensitive to key-point correspondence selection. This work proposes a principled way to enhance novel view synthesis solutions drawing inspiration from the basics of multiple view geometry. By leveraging the complementary behavior of MVS and monocular depth, we arrive at a better scene depth per view for nearby and far points, respectively. Moreover, our approach jointly refines camera poses with image-based rendering via multiple rotation averaging graph optimization. The recovered scene depth and the camera-pose help better view-dependent on-surface feature aggregation of the entire scene. Extensive evaluation of our approach on the popular benchmark dataset, such as Tanks and Temples, shows substantial improvement in view synthesis results compared to the prior art. For instance, our method shows 1.5 dB of PSNR improvement on the Tank and Temples. Similar statistics are observed when tested on other benchmark datasets such as FVS, Mip-NeRF 360, and DTU.

翻译：本文提出了一种增强自由移动摄像机拍摄的图像的新视角综合的方法。该方法主要针对户外场景，因该场景中准确地恢复几何结构和相机位姿具有挑战性，使用最先进的稳定视图综合（SVS）方法会导致结果较差。与户外场景相关的 SVS 方法和相关方法主要由于以下两个原因而失败：（i）过度依赖多视角立体（MVS）进行几何结构恢复，（ii）将 COLMAP 计算的相机姿态视为最佳估计值，尽管已经对 MVS 3D 重建精度受到场景视差影响和相机姿态精度对关键点对应性选择敏感进行了深入的研究。本文提出了一种提高新视角综合解决方案的方法，灵感来自于多视角几何的基本知识。通过利用 MVS 和单ocular深度的互补行为，我们得到了更好的场景深度视图。此外，我们的方法通过多旋转平均图优化来联合优化相机姿态和基于图像的渲染。恢复的场景深度和相机姿态有助于更好地在表面上聚合整个场景的视图相关特征。将我们的方法在流行的基准数据集上进行广泛评估，例如坦克和寺庙，与现有技术相比，结果表明在视角综合方面有了实质性的改进。例如，我们的方法在坦克和寺庙上显示了1.5个PSNR的提高。在其他基准数据集，如FVS，Mip-NeRF 360和DTU上测试时，也观察到了类似的统计数据。