A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility. Despite advances in humanoid control, bridging high-level multimodal perception with whole-body execution remains a significant bottleneck. Existing methods often struggle to translate heterogeneous instructions -- such as language, music, and trajectories -- into stable, real-time actions. Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions. We validate UniAct on UniMoCap, our 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios. Our results mark a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control.
翻译:人形机器人领域的一个长期目标是实现能够以人类水平的灵活性遵循多样化多模态指令的通用智能体。尽管人形机器人控制技术取得了进展,但将高级多模态感知与全身运动执行相连接仍然是一个重大瓶颈。现有方法往往难以将异质指令(如语言、音乐和轨迹)转化为稳定、实时的动作。本文展示,UniAct——一个集成了微调MLLM与因果流式处理管道的两阶段框架——能够使人形机器人以低于500毫秒的延迟执行多模态指令。通过FSQ共享离散码本统一输入,UniAct确保了跨模态对齐,同时将运动约束在物理可行的流形上。该方法在零样本跟踪不完美参考运动的成功率上实现了19%的提升。我们在UniMoCap(我们构建的20小时人形机器人运动基准数据集)上验证了UniAct,证明了其在多样化现实场景中具有鲁棒的泛化能力。我们的成果标志着向响应式通用人形助手迈出了关键一步,这类助手能够通过统一的感知与控制实现无缝交互。