We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.
翻译:我们提出了多视图金字塔Transformer(MVP),一种可扩展的多视图Transformer架构,能够通过单次前向传播直接从数十到数百张图像重建大规模三维场景。借鉴“观全局以见整体,察细微以见细节”的思想,MVP基于两个核心设计原则构建:1)局部到全局的视图间层次结构,逐步将模型的视角从局部视图扩展到视图组,最终覆盖整个场景;2)精细到粗糙的视图内层次结构,从详细的空间表示出发,逐步将其聚合为紧凑且信息密集的token。这种双重层次结构实现了计算效率与表示丰富性的统一,支持快速重建大规模复杂场景。我们在多个数据集上验证了MVP的性能,结果表明,当与3D高斯泼溅作为底层三维表示结合时,MVP在保持高效率和广泛视图配置下可扩展性的同时,实现了最先进的泛化重建质量。