Photorealistic 3-D reconstruction from monocular video collapses in large-scale scenes when depth, pose, and radiance are solved in isolation: scale-ambiguous depth yields ghost geometry, long-horizon pose drift corrupts alignment, and a single global NeRF cannot model hundreds of metres of content. We introduce a joint learning framework that couples all three factors and demonstrably overcomes each failure case. Our system begins with a Vision-Transformer (ViT) depth network trained with metric-scale supervision, giving globally consistent depths despite wide field-of-view variations. A multi-scale feature bundle-adjustment (BA) layer refines camera poses directly in feature space--leveraging learned pyramidal descriptors instead of brittle keypoints--to suppress drift on unconstrained trajectories. For scene representation, we deploy an incremental local-radiance-field hierarchy: new hash-grid NeRFs are allocated and frozen on-the-fly when view overlap falls below a threshold, enabling city-block-scale coverage on a single GPU. Evaluated on the Tanks and Temples benchmark, our method reduces Absolute Trajectory Error to 0.001-0.021 m across eight indoor-outdoor sequences--up to 18x lower than BARF and 2x lower than NoPe-NeRF--while maintaining sub-pixel Relative Pose Error. These results demonstrate that metric-scale, drift-free 3-D reconstruction and high-fidelity novel-view synthesis are achievable from a single uncalibrated RGB camera.
翻译:当深度、姿态与辐射场被孤立求解时,单目视频的光照真实三维重建在大规模场景中会失效:尺度模糊的深度导致伪影几何结构,长时程姿态漂移破坏对齐效果,而单一的全局神经辐射场(NeRF)无法建模数百米范围的内容。本文提出一种联合学习框架,将三个因素耦合起来,并显著克服了各类失效情况。我们的系统首先采用具有度量尺度监督训练的视觉Transformer(ViT)深度网络,即使在广视角变化下也能提供全局一致的深度估计。一个多尺度特征束调整(BA)层直接在特征空间中优化相机姿态——利用学习得到的金字塔特征描述符而非脆弱的特征点——以抑制无约束轨迹上的漂移。在场景表示方面,我们部署了一种增量式局部辐射场层次结构:当视角重叠度低于阈值时,系统动态分配并冻结新的哈希网格神经辐射场(NeRF),从而在单块GPU上实现城市街区尺度的场景覆盖。在Tanks and Temples基准测试中,本方法在八条室内外序列上将绝对轨迹误差降低至0.001-0.021米——较BARF方法降低达18倍,较NoPe-NeRF降低2倍——同时保持亚像素级的相对姿态误差。这些结果表明,仅通过单个未标定的RGB相机即可实现度量尺度、无漂移的三维重建与高保真度新视角合成。