Novel view synthesis (NVS) seeks to render photorealistic, 3D-consistent images of a scene from unseen camera poses given only a sparse set of posed views. Existing deterministic networks render observed regions quickly but blur unobserved areas, whereas stochastic diffusion-based methods hallucinate plausible content yet incur heavy training- and inference-time costs. In this paper, we propose a hybrid framework that unifies the strengths of both paradigms. A bidirectional transformer encodes multi-view image tokens and Plucker-ray embeddings, producing a shared latent representation. Two lightweight heads then act on this representation: (i) a feed-forward regression head that renders pixels where geometry is well constrained, and (ii) a masked autoregressive diffusion head that completes occluded or unseen regions. The entire model is trained end-to-end with joint photometric and diffusion losses, without handcrafted 3D inductive biases, enabling scalability across diverse scenes. Experiments demonstrate that our method attains state-of-the-art image quality while reducing rendering time by an order of magnitude compared with fully generative baselines.
翻译:新视角合成(NVS)旨在仅给定稀疏的带姿态视图,从未见过的相机姿态渲染出具有照片级真实感且三维一致的场景图像。现有的确定性网络能快速渲染已观测区域,但会使未观测区域模糊;而基于随机扩散的方法能生成合理内容,却需承担高昂的训练与推理成本。本文提出一种混合框架,统一了两种范式的优势。一个双向Transformer编码多视图图像标记与普吕克射线嵌入,生成共享的潜在表示。随后两个轻量级头部作用于该表示:(i)一个前馈回归头部,用于在几何约束良好的位置渲染像素;(ii)一个掩码自回归扩散头部,用于补全被遮挡或未观测的区域。整个模型通过联合光度损失与扩散损失进行端到端训练,无需手工设计的三维归纳偏置,从而能够跨多样场景扩展。实验表明,与完全生成式基线方法相比,我们的方法在实现最先进图像质量的同时,将渲染时间降低了一个数量级。