Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive performance of Vision Language Models (VLMs) on 2D images to directly infer robot end-effector poses in image frame coordinates. Unlike prior VLA models that output low-level controls, our model predicts trajectory waypoints, making it both more efficient to train and robot embodiment agnostic. Despite its lightweight design, our next-token prediction architecture effectively learns meaningful and executable robot trajectories. We further explore the underutilized potential of incorporating depth images, inference-time techniques such as decoding strategies, and demonstration-conditioned action generation. Our model is trained on a simulated dataset and exhibits strong sim-to-real transfer capabilities. We evaluate our approach using a combination of simulated and real data, demonstrating its effectiveness on a real robotic system.
翻译:视觉-语言-动作(VLA)模型为处理复杂的机器人操作任务提供了一个引人注目的框架,但其训练成本往往高昂。本文提出了一种新颖的VLA方法,该方法利用视觉语言模型(VLM)在二维图像上的优异性能,直接在图像坐标系中推断机器人末端执行器的位姿。与先前输出底层控制信号的VLA模型不同,我们的模型预测轨迹路径点,这使其训练效率更高且与机器人具体构型无关。尽管设计轻量,我们的下一令牌预测架构能有效学习有意义且可执行的机器人轨迹。我们进一步探索了融入深度图像、解码策略等推理时技术以及演示条件化动作生成等尚未充分利用的潜力。我们的模型在仿真数据集上进行训练,并展现出强大的仿真到现实迁移能力。我们结合仿真与真实数据评估了所提方法,并在真实机器人系统上验证了其有效性。