Fine-grained robotic manipulation requires grounding natural language into appropriate affordance targets. However, most existing methods driven by foundation models often compress rich semantics into oversimplified affordances, preventing exploitation of implicit semantic information. To address these challenges, we present ReSemAct, a novel unified manipulation framework that introduces Semantic Structuring and Affordance Refinement (SSAR), powered by the automated synergistic reasoning between Multimodal Large Language Models (MLLMs) and Vision Foundation Models (VFMs). Specifically, the Semantic Structuring module derives a unified semantic affordance description from natural language and RGB observations, organizing affordance regions, implicit functional intent, and coarse affordance anchors into a structured representation for downstream refinement. Building upon this specification, the Affordance Refinement strategy instantiates two complementary flows that separately specialize geometry and position, yielding fine-grained affordance targets. These refined targets are then encoded as real-time joint-space optimization objectives, enabling reactive and robust manipulation in dynamic environments. Extensive simulation and real-world experiments are conducted in semantically rich household and sparse chemical lab environments. The results demonstrate that ReSemAct performs diverse tasks under zero-shot conditions, showcasing the robustness of SSAR with foundation models in fine-grained manipulation. Code and videos at https://github.com/scy-v/ReSemAct and https://resemact.github.io.
翻译:精细机器人操作需要将自然语言与适当的可供性目标进行关联。然而,现有大多数基于基础模型的方法往往将丰富的语义信息压缩为过度简化的可供性表示,从而阻碍了隐含语义信息的有效利用。为解决这些挑战,本文提出ReSemAct——一种新颖的统一操作框架,该框架引入了语义结构化与可供性细粒度化(SSAR)机制,其核心在于多模态大语言模型(MLLMs)与视觉基础模型(VFMs)之间的自动化协同推理。具体而言,语义结构化模块从自然语言与RGB观测中推导出统一的语义可供性描述,将可供性区域、隐含功能意图及粗略可供性锚点组织为结构化表示,以供下游细化处理。基于此规范,可供性细粒度化策略实例化了两个互补流程,分别专注于几何形状与位置的精化,从而生成细粒度的可供性目标。这些精细化目标随后被编码为实时关节空间优化目标,实现在动态环境中的响应式鲁棒操作。我们在语义丰富的家庭环境与稀疏的化学实验室环境中进行了大量仿真与实物实验。结果表明,ReSemAct能够在零样本条件下执行多样化任务,展现了SSAR结合基础模型在精细操作中的鲁棒性。代码与视频详见https://github.com/scy-v/ReSemAct 与 https://resemact.github.io。