Foundation models applied in robotics, particularly \textbf{Vision--Language--Action (VLA)} models, hold great promise for achieving general-purpose manipulation. Yet, systematic real-world evaluations and cross-model comparisons remain scarce. This paper reports our \textbf{empirical experiences} from benchmarking four representative VLAs -- \textbf{ACT}, \textbf{OpenVLA--OFT}, \textbf{RDT-1B}, and \boldmath{$π_0$} -- across four manipulation tasks conducted in both simulation and on the \textbf{ALOHA Mobile} platform. We establish a \textbf{standardized evaluation framework} that measures performance along three key dimensions: (1) \textit{accuracy and efficiency} (success rate and time-to-success), (2) \textit{adaptability} across in-distribution, spatial out-of-distribution, and instance-plus-spatial out-of-distribution settings, and (3) \textit{language instruction-following accuracy}. Through this process, we observe that \boldmath{$π_0$} demonstrates superior adaptability in out-of-distribution scenarios, while \textbf{ACT} provides the highest stability in-distribution. Further analysis highlights differences in computational demands, data-scaling behavior, and recurring failure modes such as near-miss grasps, premature releases, and long-horizon state drift. These findings reveal practical trade-offs among VLA model architectures in balancing precision, generalization, and deployment cost, offering actionable insights for selecting and deploying VLAs in real-world robotic manipulation tasks.
翻译:应用于机器人学的基础模型,特别是**视觉-语言-动作(VLA)模型**,为实现通用操作展现了巨大潜力。然而,系统化的真实世界评估与跨模型比较研究仍显不足。本文报告了我们在仿真环境和**ALOHA Mobile**平台上,对四种代表性VLA模型——**ACT**、**OpenVLA-OFT**、**RDT-1B**及**$π_0$**——进行四项操作任务基准测试的**实证经验**。我们建立了一个**标准化评估框架**,从三个关键维度衡量模型性能:(1)**准确性与效率**(成功率与达成时间),(2)**适应能力**(涵盖分布内、空间分布外及实例-空间复合分布外场景),以及(3)**语言指令遵循准确度**。通过测试发现,**$π_0$**在分布外场景中表现出更强的适应能力,而**ACT**在分布内环境中具有最高的稳定性。进一步分析揭示了模型在计算需求、数据扩展行为方面的差异,以及反复出现的故障模式(如近距抓取失误、提前释放和长时程状态漂移)。这些发现揭示了不同VLA模型架构在精度、泛化能力与部署成本之间存在的实际权衡关系,为在实际机器人操作任务中选择和部署VLA模型提供了可操作的见解。