Inferring the structure of 3D scenes from 2D observations is a fundamental challenge in computer vision. Recently popularized approaches based on neural scene representations have achieved tremendous impact and have been applied across a variety of applications. One of the major remaining challenges in this space is training a single model which can provide latent representations which effectively generalize beyond a single scene. Scene Representation Transformer (SRT) has shown promise in this direction, but scaling it to a larger set of diverse scenes is challenging and necessitates accurately posed ground truth data. To address this problem, we propose RUST (Really Unposed Scene representation Transformer), a pose-free approach to novel view synthesis trained on RGB images alone. Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent pose embedding which is used by the decoder for view synthesis. We perform an empirical investigation into the learned latent pose structure and show that it allows meaningful test-time camera transformations and accurate explicit pose readouts. Perhaps surprisingly, RUST achieves similar quality as methods which have access to perfect camera pose, thereby unlocking the potential for large-scale training of amortized neural scene representations.
翻译:从2D观测中推断3D场景的结构是计算机视觉中的一个基本挑战。最近的基于神经场景表示的方法已经取得了巨大的影响,并已应用于各种应用。这个领域中仍然存在的一个主要挑战是训练一个单一的模型,它能够提供有效地泛化到单一场景之外的潜在表示。场景表示变换器(SRT)在这个方向上显示出了前景,但将其扩展到更大的不同场景集是具有挑战性的,并且需要准确姿态的地面事实数据。为了解决这个问题,我们提出了RUST(真正的非姿态场景表示变换器),这是一种仅依靠RGB图像进行训练的新视角合成的非姿态方法。我们的主要见解是,可以训练一个姿态编码器,该编码器查看目标图像并学习一个用于视角合成的潜在姿态嵌入。我们对学习到的潜在姿态结构进行了实证研究,并表明它允许有意义的测试时间相机变换和准确的显式姿态读数。也许令人惊讶的是,RUST实现了与具有完美相机姿态访问权限的方法类似的质量,从而释放了大规模训练摊销神经场景表示的潜力。