Vision-Language-Action (VLA) models empower robots to understand and execute tasks described by natural language instructions. However, a key challenge lies in their ability to generalize beyond the specific environments and conditions they were trained on, which is presently difficult and expensive to evaluate in the real-world. To address this gap, we present REALM, a new simulation environment and benchmark designed to evaluate the generalization capabilities of VLA models, with a specific emphasis on establishing a strong correlation between simulated and real-world performance through high-fidelity visuals and aligned robot control. Our environment offers a suite of 15 perturbation factors, 7 manipulation skills, and more than 3,500 objects. Finally, we establish two task sets that form our benchmark and evaluate the π_{0}, π_{0}-FAST, and GR00T N1.5 VLA models, showing that generalization and robustness remain an open challenge. More broadly, we also show that simulation gives us a valuable proxy for the real-world and allows us to systematically probe for and quantify the weaknesses and failure modes of VLAs. Project page: https://martin-sedlacek.com/realm
翻译:视觉-语言-动作(VLA)模型使机器人能够理解并执行由自然语言指令描述的任务。然而,一个关键的挑战在于它们能否泛化到训练时所处的特定环境和条件之外,而目前这在现实世界中难以评估且成本高昂。为了弥补这一差距,我们提出了REALM,这是一个新的仿真环境和基准,旨在评估VLA模型的泛化能力,特别强调通过高保真视觉效果和对齐的机器人控制,在仿真性能与现实世界性能之间建立强相关性。我们的环境提供了一套包含15个扰动因子、7种操作技能以及超过3,500个对象的测试集。最后,我们建立了两个任务集作为我们的基准,并对π_{0}、π_{0}-FAST和GR00T N1.5 VLA模型进行了评估,结果表明泛化性和鲁棒性仍然是一个开放的挑战。更广泛地说,我们还表明,仿真为我们提供了一个有价值的现实世界代理,使我们能够系统地探测并量化VLA的弱点和失败模式。项目页面:https://martin-sedlacek.com/realm