CronusVLA：通过多帧视觉-语言-动作建模实现高效鲁棒的机器人操作 (CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling)

Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR. These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.

翻译：基于预训练视觉-语言模型（VLM）构建的视觉-语言-动作（VLA）模型近期在机器人操作任务中展现出卓越性能。然而，现有模型仍受限于单帧图像范式，未能充分利用多帧历史提供的时序信息，因为直接将多帧图像输入VLM骨干网络会带来显著的计算开销和推理延迟。本文提出CronusVLA——一个将单帧VLA模型扩展至多帧范式的统一框架。该框架采用两阶段训练流程：（1）在具身智能大规模数据集上进行单帧预训练，通过自回归预测动作令牌建立有效的具身视觉-语言基础；（2）多帧后训练阶段，将视觉-语言骨干网络的预测目标从离散令牌调整为可学习特征，并通过特征分块聚合历史信息。CronusVLA在提升性能与观测鲁棒性的同时，有效解决了多帧建模的现存挑战。为评估时空干扰下的鲁棒性，我们提出SimplerEnv-OR新基准，包含24类观测干扰和120个严重度等级。在模拟与真实环境的三种具身场景实验中，CronusVLA取得领先性能与卓越鲁棒性：在SimplerEnv上达到70.9%成功率，在LIBERO基准上较OpenVLA提升26.8%，并在SimplerEnv-OR上获得最高鲁棒性评分。这些结果揭示了高效多帧适配在VLA模型中的潜力，为更强大、更鲁棒的现实世界部署提供了新路径。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日