Fast and flexible 3D scene reconstruction from unstructured image collections remains a significant challenge. We present YoNoSplat, a feedforward model that reconstructs high-quality 3D Gaussian Splatting representations from an arbitrary number of images. Our model is highly versatile, operating effectively with both posed and unposed, calibrated and uncalibrated inputs. YoNoSplat predicts local Gaussians and camera poses for each view, which are aggregated into a global representation using either predicted or provided poses. To overcome the inherent difficulty of jointly learning 3D Gaussians and camera parameters, we introduce a novel mixing training strategy. This approach mitigates the entanglement between the two tasks by initially using ground-truth poses to aggregate local Gaussians and gradually transitioning to a mix of predicted and ground-truth poses, which prevents both training instability and exposure bias. We further resolve the scale ambiguity problem by a novel pairwise camera-distance normalization scheme and by embedding camera intrinsics into the network. Moreover, YoNoSplat also predicts intrinsic parameters, making it feasible for uncalibrated inputs. YoNoSplat demonstrates exceptional efficiency, reconstructing a scene from 100 views (at 280x518 resolution) in just 2.69 seconds on an NVIDIA GH200 GPU. It achieves state-of-the-art performance on standard benchmarks in both pose-free and pose-dependent settings. Our project page is at https://botaoye.github.io/yonosplat/.
翻译:从非结构化图像集合中实现快速灵活的3D场景重建仍是一项重大挑战。本文提出YoNoSplat,这是一种前馈模型,能够从任意数量的图像中重建高质量的3D高斯泼溅表示。我们的模型具有高度通用性,可有效处理带姿态与无姿态、已标定与未标定的输入数据。YoNoSplat为每个视角预测局部高斯分布与相机姿态,并通过预测或提供的姿态将其聚合为全局表示。为克服联合学习3D高斯分布与相机参数的内在困难,我们引入了一种新颖的混合训练策略。该方法通过初始使用真实姿态聚合局部高斯分布,并逐步过渡至预测姿态与真实姿态的混合使用,从而缓解两项任务间的纠缠问题,避免训练不稳定性和曝光偏差。我们进一步通过创新的成对相机距离归一化方案及将相机内参嵌入网络的方式,解决了尺度模糊性问题。此外,YoNoSplat还能预测内参,使其能够处理未标定的输入。YoNoSplat展现出卓越的效率,在NVIDIA GH200 GPU上仅需2.69秒即可从100个视角(分辨率为280x518)重建场景。在标准基准测试中,该模型在无姿态依赖和姿态依赖两种设置下均实现了最先进的性能。项目页面位于https://botaoye.github.io/yonosplat/。