Novel view synthesis from monocular videos of dynamic scenes with unknown camera poses remains a fundamental challenge in computer vision and graphics. While recent advances in 3D representations such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown promising results for static scenes, they struggle with dynamic content and typically rely on pre-computed camera poses. We present 4D3R, a pose-free dynamic neural rendering framework that decouples static and dynamic components through a two-stage approach. Our method first leverages 3D foundational models for initial pose and geometry estimation, followed by motion-aware refinement. 4D3R introduces two key technical innovations: (1) a motion-aware bundle adjustment (MA-BA) module that combines transformer-based learned priors with SAM2 for robust dynamic object segmentation, enabling more accurate camera pose refinement; and (2) an efficient Motion-Aware Gaussian Splatting (MA-GS) representation that uses control points with a deformation field MLP and linear blend skinning to model dynamic motion, significantly reducing computational cost while maintaining high-quality reconstruction. Extensive experiments on real-world dynamic datasets demonstrate that our approach achieves up to 1.8dB PSNR improvement over state-of-the-art methods, particularly in challenging scenarios with large dynamic objects, while reducing computational requirements by 5x compared to previous dynamic scene representations.
翻译:从相机姿态未知的单目动态场景视频中进行新视角合成,仍然是计算机视觉与图形学中的一个基础性挑战。尽管近年来基于神经辐射场(NeRF)和3D高斯溅射(3DGS)等三维表征方法在静态场景中取得了显著进展,但它们难以处理动态内容,并且通常依赖于预计算的相机姿态。本文提出4D3R,一种无需姿态先验的动态神经渲染框架,通过两阶段方法解耦静态与动态分量。我们的方法首先利用三维基础模型进行初始姿态与几何估计,随后进行运动感知的精细化处理。4D3R引入了两项关键技术创新:(1)运动感知光束法平差(MA-BA)模块,结合基于Transformer的学习先验与SAM2实现鲁棒的动态对象分割,从而支持更精确的相机姿态优化;(2)高效的**运动感知高斯溅射(MA-GS)**表征,采用带变形场MLP的控制点与线性混合蒙皮来建模动态运动,在保持高质量重建的同时显著降低了计算成本。在真实世界动态数据集上的大量实验表明,我们的方法相比现有最优方法实现了高达1.8dB的PSNR提升,尤其在包含大尺度动态物体的挑战性场景中表现突出,同时计算需求较先前的动态场景表征方法降低了5倍。