Vision-Language-Action (VLA) models have emerged as a powerful framework that unifies perception, language, and control, enabling robots to perform diverse tasks through multimodal understanding. However, current VLA models typically contain massive parameters and rely heavily on large-scale robot data pretraining, leading to high computational costs during training, as well as limited deployability for real-time inference. Moreover, most training paradigms often degrade the perceptual representations of the vision-language backbone, resulting in overfitting and poor generalization to downstream tasks. In this work, we present Evo-1, a lightweight VLA model that reduces computation and improves deployment efficiency, while maintaining strong performance without pretraining on robot data. Evo-1 builds on a native multimodal Vision-Language model (VLM), incorporating a novel cross-modulated diffusion transformer along with an optimized integration module, together forming an effective architecture. We further introduce a two-stage training paradigm that progressively aligns action with perception, preserving the representations of the VLM. Notably, with only 0.77 billion parameters, Evo-1 achieves state-of-the-art results on the Meta-World and RoboTwin suite, surpassing the previous best models by 12.4% and 6.9%, respectively, and also attains a competitive result of 94.8% on LIBERO. In real-world evaluations, Evo-1 attains a 78% success rate with high inference frequency and low memory overhead, outperforming all baseline methods. We release code, data, and model weights to facilitate future research on lightweight and efficient VLA models.
翻译:视觉-语言-动作(VLA)模型已成为一种强大的框架,它统一了感知、语言与控制,使机器人能够通过多模态理解执行多样化任务。然而,当前的VLA模型通常包含海量参数,并严重依赖大规模机器人数据预训练,导致训练过程中计算成本高昂,且实时推理部署能力受限。此外,大多数训练范式往往会削弱视觉-语言骨干网络的感知表征能力,导致过拟合及对下游任务的泛化性能不佳。本文提出Evo-1,一种轻量级VLA模型,它在减少计算量并提升部署效率的同时,无需机器人数据预训练即可保持强劲性能。Evo-1基于原生多模态视觉-语言模型(VLM)构建,引入了一种新颖的交叉调制扩散Transformer及优化的集成模块,共同构成高效架构。我们进一步提出两阶段训练范式,逐步实现动作与感知的对齐,从而保留VLM的表征能力。值得注意的是,仅凭7.7亿参数,Evo-1在Meta-World与RoboTwin基准测试中均取得最先进成果,分别超越先前最佳模型12.4%与6.9%,并在LIBERO基准上获得94.8%的竞争性结果。在真实世界评估中,Evo-1以高推理频率与低内存开销实现了78%的成功率,优于所有基线方法。我们公开了代码、数据与模型权重,以促进轻量化高效VLA模型的未来研究。