GigaBrain-0：一个由世界模型驱动的视觉-语言-动作模型 (GigaBrain-0: A World Model-Powered Vision-Language-Action Model)

GigaBrain Team,Angen Ye,Boyuan Wang,Chaojun Ni,Guan Huang,Guosheng Zhao,Haoyun Li,Jie Li,Jiagang Zhu,Lv Feng,Peng Li,Qiuping Deng,Runqi Ouyang,Wenkang Qin,Xinze Chen,Xiaofeng Wang,Yang Wang,Yifan Li,Yilong Li,Yiran Ding,Yuan Xu,Yun Ye,Yukun Zhou,Zhehao Dong,Zhenan Wang,Zhichao Liu,Zheng Zhu

from arxiv, https://gigabrain0.github.io/

Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.

翻译：训练用于通用机器人的视觉-语言-动作（VLA）模型通常需要大规模的真实世界机器人数据，而这类数据的收集成本高昂且耗时。物理数据收集的低效性严重限制了当前VLA系统的可扩展性和泛化能力。为应对这一挑战，我们提出了GigaBrain-0，这是一种新颖的VLA基础模型，其能力源自世界模型生成的数据（例如，视频生成、真实到真实迁移、人类迁移、视角迁移、仿真到真实迁移数据）。通过利用世界模型大规模生成多样化数据，GigaBrain-0显著降低了对真实机器人数据的依赖，同时提升了跨任务泛化能力。我们的方法进一步通过RGBD输入建模和具身思维链（CoT）监督来提升策略的鲁棒性，使模型能够在任务执行过程中推理空间几何、物体状态和长时程依赖关系。这使其在灵巧操作、长时程任务和移动操作任务上的真实世界性能获得了显著提升。大量实验表明，GigaBrain-0在外观（如纹理、颜色）、物体摆放和相机视角的变化上均表现出卓越的泛化能力。此外，我们还推出了GigaBrain-0-Small，这是一个经过优化的轻量级变体，专为在NVIDIA Jetson AGX Orin等设备上高效运行而设计。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日