Novel view synthesis is a long-standing problem. In this work, we consider a variant of the problem where we are given only a few context views sparsely covering a scene or an object. The goal is to predict novel viewpoints in the scene, which requires learning priors. The current state of the art is based on Neural Radiance Field (NeRF), and while achieving impressive results, the methods suffer from long training times as they require evaluating millions of 3D point samples via a neural network for each image. We propose a 2D-only method that maps multiple context views and a query pose to a new image in a single pass of a neural network. Our model uses a two-stage architecture consisting of a codebook and a transformer model. The codebook is used to embed individual images into a smaller latent space, and the transformer solves the view synthesis task in this more compact space. To train our model efficiently, we introduce a novel branching attention mechanism that allows us to use the same model not only for neural rendering but also for camera pose estimation. Experimental results on real-world scenes show that our approach is competitive compared to NeRF-based methods while not reasoning explicitly in 3D, and it is faster to train.
翻译:新观点合成是一个长期存在的问题。 在这项工作中, 我们考虑一个问题的变种, 即我们只得到少量背景视图, 很少覆盖一个场景或一个对象。 目标是预测场景中的新观点, 需要学习前科。 艺术的当前状态基于神经光谱场( NERF ), 在取得令人印象深刻的成果的同时, 方法也经历了漫长的培训时间, 因为需要通过神经网络对每个图像的数百万个 3D 点 样本进行评估。 我们建议了一种仅使用 2D 的方法, 用于绘制多个背景视图和查询, 在神经网络的单关口显示新图像。 我们的模型使用由代码簿和变异模型组成的两阶段结构。 代码手册用来将单个图像嵌入一个较小的潜在空间, 而变异器则解决了这个更紧凑的空间的视图合成任务。 为了高效地培训我们的模型, 我们引入了一个新型的分支关注机制, 允许我们使用同样的模型, 不仅用于神经显示, 而且还用于相机的估测图象。 现实世界场景的实验结果显示我们的方法比NRF 3 更具有竞争力, 。 。