GeoAware-VLA：隐式几何感知的视觉-语言-动作模型 (GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model)

Vision-Language-Action (VLA) models often fail to generalize to novel camera viewpoints, a limitation stemming from their difficulty in inferring robust 3D geometry from 2D images. We introduce GeoAware-VLA, a simple yet effective approach that enhances viewpoint invariance by integrating strong geometric priors into the vision backbone. Instead of training a visual encoder or relying on explicit 3D data, we leverage a frozen, pretrained geometric vision model as a feature extractor. A trainable projection layer then adapts these geometrically-rich features for the policy decoder, relieving it of the burden of learning 3D consistency from scratch. Through extensive evaluations on LIBERO benchmark subsets, we show GeoAware-VLA achieves substantial improvements in zero-shot generalization to novel camera poses, boosting success rates by over 2x in simulation. Crucially, these benefits translate to the physical world; our model shows a significant performance gain on a real robot, especially when evaluated from unseen camera angles. Our approach proves effective across both continuous and discrete action spaces, highlighting that robust geometric grounding is a key component for creating more generalizable robotic agents.

翻译：视觉-语言-动作（VLA）模型通常难以泛化至新的相机视角，这一局限源于其从二维图像推断鲁棒三维几何的困难。我们提出GeoAware-VLA，一种简单而有效的方法，通过将强几何先验集成到视觉骨干网络中，增强视角不变性。该方法无需训练视觉编码器或依赖显式三维数据，而是利用一个冻结的预训练几何视觉模型作为特征提取器。随后，一个可训练的投影层将这些富含几何信息的特征适配至策略解码器，从而减轻其从头学习三维一致性的负担。通过在LIBERO基准测试子集上的广泛评估，我们证明GeoAware-VLA在零样本泛化至新相机位姿方面取得显著提升，在仿真环境中成功率提高超过2倍。关键的是，这些优势在物理世界中得以体现；我们的模型在真实机器人上表现出显著的性能增益，尤其是在从未见过的相机角度进行评估时。我们的方法在连续和离散动作空间均证明有效，突显了鲁棒的几何基础是构建更具泛化能力的机器人智能体的关键组成部分。