MetaVLA：面向高效具身适应的统一元协同训练框架 (MetaVLA: Unified Meta Co-training For Efficient Embodied Adaption)

Vision-Language-Action (VLA) models show promise in embodied reasoning, yet remain far from true generalists-they often require task-specific fine-tuning, and generalize poorly to unseen tasks. We propose MetaVLA, a unified, backbone-agnostic post-training framework for efficient and scalable alignment. MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse target tasks into a single fine-tuning stage while leveraging structurally diverse auxiliary tasks to improve in-domain generalization. Unlike naive multi-task SFT, MetaVLA integrates a lightweight meta-learning mechanism-derived from Attentive Neural Processes-to enable rapid adaptation from diverse contexts with minimal architectural change or inference overhead. On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by ~76%. These results show that scalable, low-resource post-training is achievable-paving the way toward general-purpose embodied agents. Code will be available.

翻译：视觉-语言-动作（VLA）模型在具身推理任务中展现出潜力，但距离真正的通用智能体仍有差距——它们通常需要针对特定任务进行微调，并且在未见任务上泛化能力较差。我们提出了MetaVLA，一个统一、主干网络无关的高效可扩展对齐后训练框架。MetaVLA引入了上下文感知元协同训练方法，该方法将多样化的目标任务整合到单一微调阶段，同时利用结构各异的辅助任务来提升领域内泛化能力。与简单的多任务监督微调不同，MetaVLA集成了一个轻量级元学习机制——源自注意力神经过程——能够在最小化架构改动或推理开销的前提下，实现从多样化上下文的快速适应。在LIBERO基准测试中，配备六项辅助任务的MetaVLA在长视野任务上以最高8.0%的优势超越OpenVLA，将训练步数从240K减少至75K，并降低约76%的GPU时间。这些结果表明，可扩展、低资源的后训练是可行的——为通向通用具身智能体铺平了道路。代码即将公开。