We present an innovative end-to-end framework for synthesizing semantically meaningful co-speech gestures and deploying them in real-time on a humanoid robot. This system addresses the challenge of creating natural, expressive non-verbal communication for robots by integrating advanced gesture generation techniques with robust physical control. Our core innovation lies in the meticulous integration of a semantics-aware gesture synthesis module, which derives expressive reference motions from speech input by leveraging a generative retrieval mechanism based on large language models (LLMs) and an autoregressive Motion-GPT model. This is coupled with a high-fidelity imitation learning control policy, the MotionTracker, which enables the Unitree G1 humanoid robot to execute these complex motions dynamically and maintain balance. To ensure feasibility, we employ a robust General Motion Retargeting (GMR) method to bridge the embodiment gap between human motion data and the robot platform. Through comprehensive evaluation, we demonstrate that our combined system produces semantically appropriate and rhythmically coherent gestures that are accurately tracked and executed by the physical robot. To our knowledge, this work represents a significant step toward general real-world use by providing a complete pipeline for automatic, semantic-aware, co-speech gesture generation and synchronized real-time physical deployment on a humanoid robot.
翻译:我们提出了一种创新的端到端框架,用于合成具有语义意义的协同语音手势,并在人形机器人上实时部署。该系统通过将先进的手势生成技术与鲁棒的物理控制相结合,解决了为机器人创造自然、富有表现力的非语言交流的挑战。我们的核心创新在于精心集成了一个语义感知的手势合成模块,该模块利用基于大语言模型(LLMs)的生成式检索机制和自回归的Motion-GPT模型,从语音输入中推导出富有表现力的参考动作。这结合了一个高保真的模仿学习控制策略——MotionTracker,它使Unitree G1人形机器人能够动态执行这些复杂动作并保持平衡。为确保可行性,我们采用了一种鲁棒的通用运动重定向(GMR)方法,以弥合人体运动数据与机器人平台之间的具身鸿沟。通过综合评估,我们证明了我们的组合系统能够产生语义恰当且节奏连贯的手势,并能被物理机器人精确跟踪和执行。据我们所知,这项工作通过提供一个完整的、用于自动、语义感知的协同语音手势生成并在人形机器人上同步实时物理部署的流程,代表了向通用现实世界应用迈出的重要一步。