A classical problem in computer vision is to infer a 3D scene representation from few images that can be used to render novel views at interactive rates. Previous work focuses on reconstructing pre-defined 3D representations, e.g. textured meshes, or implicit representations, e.g. radiance fields, and often requires input images with precise camera poses and long processing times for each novel scene. In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a "set-latent scene representation", and synthesises novel views, all in a single feed-forward pass. To calculate the scene representation, we propose a generalization of the Vision Transformer to sets of images, enabling global information integration, and hence 3D reasoning. An efficient decoder transformer parameterizes the light field by attending into the scene representation to render novel views. Learning is supervised end-to-end by minimizing a novel-view reconstruction error. We show that this method outperforms recent baselines in terms of PSNR and speed on synthetic datasets, including a new dataset created for the paper. Further, we demonstrate that SRT scales to support interactive visualization and semantic segmentation of real-world outdoor environments using Street View imagery.
翻译:计算机视觉的一个典型问题是,从能够用来以交互速度提供新观点的少数图像中推断出一个 3D 场景显示方式。 先前的工作侧重于重建预定义的 3D 显示方式, 例如, 纹线模头, 或隐含的表示方式, 例如, 亮度字段, 通常需要输入图像, 配有精确的相机, 并且每个新场景都有很长的处理时间。 在这项工作中, 我们提议了场景显示变异器( SRT), 这是一种处理新区域显示或未保存的 RGB 图像的方法, 推断出一个“ 固定的场景显示方式”, 并合成前向前方传递新观点。 为了计算场景表现, 我们建议将视野变异器概括为图像集集集, 促成全球信息整合, 并由此推理 3D 推理 。 一个高效的解码变异器将光场参数化器通过参加场景显示新观点来监督最终到最后的学习过程, 尽量减少新观点重建错误。 我们显示这一方法超越了当前PSNR的基线基线, 以及合成图象转换的快速图像的图像显示,, 将显示我们所创建到真实的图像的图像环境的图像的深度显示到真实环境, 。