The Scene Representation Transformer (SRT) is a recent method to render novel views at interactive rates. Since SRT uses camera poses with respect to an arbitrarily chosen reference camera, it is not invariant to the order of the input views. As a result, SRT is not directly applicable to large-scale scenes where the reference frame would need to be changed regularly. In this work, we propose Relative Pose Attention SRT (RePAST): Instead of fixing a reference frame at the input, we inject pairwise relative camera pose information directly into the attention mechanism of the Transformers. This leads to a model that is by definition invariant to the choice of any global reference frame, while still retaining the full capabilities of the original method. Empirical results show that adding this invariance to the model does not lead to a loss in quality. We believe that this is a step towards applying fully latent transformer-based rendering methods to large-scale scenes.
翻译:场景表示变换器(SRT)是一种最近使用的在交互速率下渲染新视角的方法。由于SRT使用相机姿态与任意选择的参考相机有关,因此它对输入视图的顺序不具有不变性。结果,SRT不能直接适用于需要定期更改参考框架的大规模场景。在本文中,我们提出了相对姿态注意SRT(RePAST):与其在输入时固定参考框架不同,我们直接将成对的相机姿态信息注入到Transformer的注意机制中。这导致了一个在全局参考框架选择方面是不变的模型,同时仍保留了原始方法的全部功能。实证结果表明,将这种不变性添加到模型中不会导致质量损失。我们相信,这是将完全潜在的基于变压器的渲染方法应用于大规模场景的一步。