3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However, it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10x speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: (1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; (2) token similarity across adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging. We analyze each token's geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT's core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT's effectiveness, scalability, and robustness. Project page: https://garlicba.github.io/LiteVGGT/
翻译:以视觉几何基础Transformer(VGGT)为代表的三维视觉基础模型在几何感知方面取得了显著进展。然而,面对长序列输入时,模型存在计算耗时与内存占用过高的问题,限制了其在超百张图像的大规模场景中的应用。为此,我们提出LiteVGGT模型,实现了最高10倍的加速比和显著的内存压缩,能够高效处理包含1000张图像的场景。我们针对三维重建任务提出两项关键发现:(1)局部图像区域生成的令牌具有固有的几何关联性,导致高相似度与计算冗余;(2)相邻网络层间的令牌相似度保持稳定,使得合并决策可跨层复用。基于此,我们设计了一种简洁高效的几何感知缓存令牌合并策略。通过分析各令牌的几何重要性,优化锚点令牌选择机制以更好地保留重建关键信息。同时,跨层缓存并复用合并索引,在精度影响最小化的前提下显著降低计算延迟。该策略完整保留了VGGT的核心性能,支持高效微调与FP8量化以获得额外增益。大量实验验证了LiteVGGT在效能、可扩展性与鲁棒性方面的优势。项目页面:https://garlicba.github.io/LiteVGGT/