Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR. These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.
翻译:基于预训练视觉-语言模型(VLM)构建的视觉-语言-动作(VLA)模型近期在机器人操作任务中展现出卓越性能。然而,现有模型仍受限于单帧图像范式,未能充分利用多帧历史提供的时序信息,因为直接将多帧图像输入VLM骨干网络会带来显著的计算开销和推理延迟。本文提出CronusVLA——一个将单帧VLA模型扩展至多帧范式的统一框架。该框架采用两阶段训练流程:(1)在具身智能大规模数据集上进行单帧预训练,通过自回归预测动作令牌建立有效的具身视觉-语言基础;(2)多帧后训练阶段,将视觉-语言骨干网络的预测目标从离散令牌调整为可学习特征,并通过特征分块聚合历史信息。CronusVLA在提升性能与观测鲁棒性的同时,有效解决了多帧建模的现存挑战。为评估时空干扰下的鲁棒性,我们提出SimplerEnv-OR新基准,包含24类观测干扰和120个严重度等级。在模拟与真实环境的三种具身场景实验中,CronusVLA取得领先性能与卓越鲁棒性:在SimplerEnv上达到70.9%成功率,在LIBERO基准上较OpenVLA提升26.8%,并在SimplerEnv-OR上获得最高鲁棒性评分。这些结果揭示了高效多帧适配在VLA模型中的潜力,为更强大、更鲁棒的现实世界部署提供了新路径。