We present Generalizable NeRF Transformer (GNT), a transformer-based architecture that reconstructs Neural Radiance Fields (NeRFs) and learns to renders novel views on the fly from source views. While prior works on NeRFs optimize a scene representation by inverting a handcrafted rendering equation, GNT achieves neural representation and rendering that generalizes across scenes using transformers at two stages. (1) The view transformer leverages multi-view geometry as an inductive bias for attention-based scene representation, and predicts coordinate-aligned features by aggregating information from epipolar lines on the neighboring views. (2) The ray transformer renders novel views using attention to decode the features from the view transformer along the sampled points during ray marching. Our experiments demonstrate that when optimized on a single scene, GNT can successfully reconstruct NeRF without an explicit rendering formula due to the learned ray renderer. When trained on multiple scenes, GNT consistently achieves state-of-the-art performance when transferring to unseen scenes and outperform all other methods by ~10% on average. Our analysis of the learned attention maps to infer depth and occlusion indicate that attention enables learning a physically-grounded rendering. Our results show the promise of transformers as a universal modeling tool for graphics. Please refer to our project page for video results: https://vita-group.github.io/GNT/.
翻译:我们展示了Genable NeRF变异器(GNT),这是一个基于变压器的架构,用来重建神经辐射场(NERFs),并学会从源视图中提供对苍蝇的新观点。 NERFs先前的作品优化了场景显示方式,对手工制作的翻版方程式进行翻转,而GNT则在两个阶段使用变压器实现神经代表制,使场景的全局化。(1) 查看变压器利用多视几何法作为关注的场景代表的一种感应偏差,预测通过在相邻视图中汇总上层线线(NERFs)的信息来协调匹配的特征。(2) 光变换器提供新观点,在射电过程中通过样本点对图像变异器进行解码。我们的实验表明,如果在单个场景上优化,GNRFs就能成功重建NRF,而没有清晰的配方。在多个场景中接受培训时,GNT在向看不见的场景场景/图组传递时,总是将状态化为状态表现,然后通过~10%/平面图面图层显示所有方法,在平均显示我们所学习的视野变的图像结果。我们所学习的视野变的图像工具。我们的分析显示的视野变的图像结果,显示的图像结果显示为平均。