Humanoid robots require precise locomotion and dexterous manipulation to perform challenging loco-manipulation tasks. Yet existing approaches, modular or end-to-end, are deficient in manipulation-aware locomotion. This confines the robot to a limited workspace, preventing it from performing large-space loco-manipulation. We attribute this to: (1) the challenge of acquiring loco-manipulation knowledge due to the scarcity of humanoid teleoperation data, and (2) the difficulty of faithfully and reliably executing locomotion commands, stemming from the limited precision and stability of existing RL controllers. To acquire richer loco-manipulation knowledge, we propose a unified latent learning framework that enables Vision-Language-Action (VLA) system to learn from low-cost action-free egocentric videos. Moreover, an efficient human data collection pipeline is devised to augment the dataset and scale the benefits. To more precisely execute the desired locomotion commands, we present a loco-manipulation-oriented (LMO) RL policy specifically tailored for accurate and stable core loco-manipulation movements, such as advancing, turning, and squatting. Building on these components, we introduce WholeBodyVLA, a unified framework for humanoid loco-manipulation. To the best of our knowledge, WholeBodyVLA is one of its kind enabling large-space humanoid loco-manipulation. It is verified via comprehensive experiments on the AgiBot X2 humanoid, outperforming prior baseline by 21.3%. It also demonstrates strong generalization and high extensibility across a broad range of tasks.
翻译:人形机器人需要精确的运动能力和灵巧的操作能力以执行具有挑战性的运动操控任务。然而,现有方法,无论是模块化还是端到端方法,在操作感知的运动控制方面均存在不足。这限制了机器人的工作空间,使其无法执行大范围的运动操控。我们将此归因于:(1)由于人形遥操作数据的稀缺性,获取运动操控知识面临挑战;(2)现有强化学习控制器精度和稳定性有限,导致难以忠实可靠地执行运动指令。为获取更丰富的运动操控知识,我们提出了一种统一的潜在学习框架,使视觉-语言-动作系统能够从低成本的无动作第一人称视频中学习。此外,设计了一套高效的人类数据收集流程以扩充数据集并扩大效益。为了更精确地执行期望的运动指令,我们提出了一种专门针对运动操控优化的强化学习策略,该策略针对精确稳定的核心运动操控动作(如前进、转向和下蹲)进行了定制。基于这些组件,我们引入了WholeBodyVLA,一个用于人形机器人运动操控的统一框架。据我们所知,WholeBodyVLA是首个实现大范围人形机器人运动操控的框架。通过在AgiBot X2人形机器人上的综合实验验证,其性能优于先前基线21.3%。该框架还在广泛任务中展现出强大的泛化能力和高可扩展性。