Most existing Vision-Language-Action (VLA) models rely primarily on RGB information, while ignoring geometric cues crucial for spatial reasoning and manipulation. In this work, we introduce GLaD, a geometry-aware VLA framework that incorporates 3D geometric priors during pretraining through knowledge distillation. Rather than distilling geometric features solely into the vision encoder, we align the LLM's hidden states corresponding to visual tokens with features from a frozen geometry-aware vision transformer (VGGT), ensuring that geometric understanding is deeply integrated into the multimodal representations that drive action prediction. Pretrained on the Bridge dataset with this geometry distillation mechanism, GLaD achieves 94.1% average success rate across four LIBERO task suites, outperforming UniVLA (92.5%) which uses identical pretraining data. These results validate that geometry-aware pretraining enhances spatial reasoning and policy generalization without requiring explicit depth sensors or 3D annotations.
翻译:现有的大多数视觉-语言-动作模型主要依赖RGB信息,而忽略了对于空间推理与操作至关重要的几何线索。本研究提出GLaD,一种几何感知的VLA框架,通过知识蒸馏在预训练阶段融入三维几何先验。我们并非仅将几何特征蒸馏至视觉编码器,而是将大语言模型中对应视觉标记的隐藏状态与冻结的几何感知视觉变换器(VGGT)的特征进行对齐,确保几何理解深度整合到驱动动作预测的多模态表征中。基于Bridge数据集并采用此几何蒸馏机制进行预训练后,GLaD在四个LIBERO任务套件中实现了94.1%的平均成功率,优于使用相同预训练数据的UniVLA(92.5%)。这些结果验证了几何感知预训练能够在不依赖显式深度传感器或三维标注的情况下,有效增强空间推理与策略泛化能力。