Recent feed-forward reconstruction models like VGGT and $π^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system by aligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($\mathrm{Sim}(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps. Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction %quality with offline models while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos. Project website: $\href{https://neu-vi.github.io/LASER/}{\texttt{https://neu-vi.github.io/LASER/}}$
翻译:近期前馈式重建模型(如VGGT与$π^3$)虽能实现卓越的重建质量,却因二次内存复杂度而无法处理流式视频,限制了实际部署。现有流式方法虽通过可学习记忆机制或因果注意力解决此问题,但需大量重新训练,且可能无法充分利用先进离线模型的强几何先验。本文提出LASER——一种免训练框架,通过对齐连续时间窗口的预测结果,将离线重建模型转换为流式系统。我们发现,简单的相似变换($\\mathrm{Sim}(3)$)对齐会因层级深度错位而失效:单目尺度歧义导致不同场景层级的相对深度尺度在窗口间不一致变化。为此,我们提出层级尺度对齐方法:将深度预测分割为离散层级,计算每层尺度因子,并在相邻窗口与时间戳间传播。大量实验表明,LASER在相机姿态估计与点云重建质量上达到最先进水平,同时在RTX A6000 GPU上以14 FPS运行且峰值内存仅6 GB,实现了千米级流式视频的实际部署。项目网站:$\\href{https://neu-vi.github.io/LASER/}{\\texttt{https://neu-vi.github.io/LASER/}}$