SafeHumanoid：基于视觉语言模型与检索增强生成的人形机器人上身阻抗控制 (SafeHumanoid: VLM-RAG-driven Control of Upper Body Impedance for Humanoid Robot)

Yara Mahmoud,Jeffrin Sam,Nguyen Khang,Marcelino Fernando,Issatay Tokmurziyev,Miguel Altamirano Cabrera,Muhammad Haris Khan,Artem Lykov,Dzmitry Tsetserukou

Safe and trustworthy Human Robot Interaction (HRI) requires robots not only to complete tasks but also to regulate impedance and speed according to scene context and human proximity. We present SafeHumanoid, an egocentric vision pipeline that links Vision Language Models (VLMs) with Retrieval-Augmented Generation (RAG) to schedule impedance and velocity parameters for a humanoid robot. Egocentric frames are processed by a structured VLM prompt, embedded and matched against a curated database of validated scenarios, and mapped to joint-level impedance commands via inverse kinematics. We evaluate the system on tabletop manipulation tasks with and without human presence, including wiping, object handovers, and liquid pouring. The results show that the pipeline adapts stiffness, damping, and speed profiles in a context-aware manner, maintaining task success while improving safety. Although current inference latency (up to 1.4 s) limits responsiveness in highly dynamic settings, SafeHumanoid demonstrates that semantic grounding of impedance control is a viable path toward safer, standard-compliant humanoid collaboration.

翻译：安全可信的人机交互不仅要求机器人完成任务，还需根据场景上下文和人员接近程度调节阻抗与速度。本文提出SafeHumanoid，一种以自我为中心视觉处理流程，通过结合视觉语言模型与检索增强生成技术，为人形机器人调度阻抗和速度参数。自我中心视角的帧图像经由结构化VLM提示处理、嵌入，并与经过验证的场景数据库进行匹配，最终通过逆运动学映射为关节级阻抗指令。我们在桌面操作任务（包括擦拭、物体传递和液体倾倒）中，评估了系统在有/无人员在场情况下的表现。结果表明，该流程能以情境感知方式自适应调整刚度、阻尼和速度曲线，在保持任务成功率的同时提升安全性。尽管当前推理延迟（最高达1.4秒）限制了其在高度动态环境中的响应能力，但SafeHumanoid证明了基于语义的阻抗控制是实现更安全、符合标准的人形机器人协作的可行路径。