Although neural radiance fields (NeRF) have shown impressive advances for novel view synthesis, most methods typically require multiple input images of the same scene with accurate camera poses. In this work, we seek to substantially reduce the inputs to a single unposed image. Existing approaches condition on local image features to reconstruct a 3D object, but often render blurry predictions at viewpoints that are far away from the source view. To address this issue, we propose to leverage both the global and local features to form an expressive 3D representation. The global features are learned from a vision transformer, while the local features are extracted from a 2D convolutional network. To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering. This novel 3D representation allows the network to reconstruct unseen regions without enforcing constraints like symmetry or canonical coordinate systems. Our method can render novel views from only a single input image and generalize across multiple object categories using a single model. Quantitative and qualitative evaluations demonstrate that the proposed method achieves state-of-the-art performance and renders richer details than existing approaches.
翻译:虽然神经弧度场(NERF)在新观点合成方面已经展示了令人印象深刻的进步,但大多数方法通常需要同一场景的多个输入图像,并配有精确的相机。在这项工作中,我们力求大量减少输入到单一的未保存图像中。现有的方法以当地图像特性为条件,重建一个3D对象,但往往对远离源视图的观点作出模糊的预测。为了解决这一问题,我们提议利用全球和地方的特征来形成一个直观的3D代表。全球特征是从一个视觉变异器中学习的,而本地特征则从2D演进网络中提取。为了合成新观点,我们培训多层光谱线(MLP)网络,以学习的3D代表为条件进行体积转换。这个新颖的 3D代表器允许网络在不强制实施诸如对称或光谱协调系统等限制的情况下重建隐蔽区域。我们的方法只能从单一的输入图像中产生新的观点,并利用单一模型对多个对象类别进行概括。定量和定性评估表明,拟议的方法能够实现状态,并使现有方法更加丰富。