Inferring the structure of 3D scenes from 2D observations is a fundamental challenge in computer vision. Recently popularized approaches based on neural scene representations have achieved tremendous impact and have been applied across a variety of applications. One of the major remaining challenges in this space is training a single model which can provide latent representations which effectively generalize beyond a single scene. Scene Representation Transformer (SRT) has shown promise in this direction, but scaling it to a larger set of diverse scenes is challenging and necessitates accurately posed ground truth data. To address this problem, we propose RUST (Really Unposed Scene representation Transformer), a pose-free approach to novel view synthesis trained on RGB images alone. Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent pose embedding which is used by the decoder for view synthesis. We perform an empirical investigation into the learned latent pose structure and show that it allows meaningful test-time camera transformations and accurate explicit pose readouts. Perhaps surprisingly, RUST achieves similar quality as methods which have access to perfect camera pose, thereby unlocking the potential for large-scale training of amortized neural scene representations.
翻译:从 2D 观测中推断 3D 场景的结构是计算机愿景中的一项根本挑战。最近基于神经场景演示的普及方法已经产生了巨大的影响,并已应用于各种应用。空间中剩下的主要挑战之一是培训一种单一模型,能够提供超越单一场景的潜在代表,从而有效地概括出一个场景。现场代表变异器(SRT)在这方面显示了希望,但将其推广到更广泛的不同场景,具有挑战性,需要准确地提供地面真实数据。为了解决这一问题,我们建议RUST(全然未投影的场面代表变换器),一种无面面合成新颖的合成方法,仅对 RGB 图像进行培训。我们的主要洞察力是,一个人可以训练一个对目标图像进行窥视并了解潜在布局的模型,这是解开解析器用于视觉合成的。我们对所学的潜形结构进行了实证性调查,并表明它允许有意义的测试时间相机变形和准确的造型。也许令人惊讶的是,RST 实现质量类似于能够进入完美相机摆的图像,从而解开大规模模拟演制的图像的可能性。