Language-conditioned manipulation facilitates human-robot interaction via behavioral cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous co-learning across vision, language, and proprioceptive inputs (e.g., robot internal states). Meanwhile, we anchor language semantics to visuomotor representations by a bidirectional cross-attention to learn contextual information for action generation, successfully overcoming the problem of semantic-physical misalignment. Extensive experiments show that CCoL achieves an average 8.0% relative improvement across three simulation suites, with up to 19.2% relative gain in human-demonstrated bimanual insertion tasks. Real-world tests on a 7-DoF robot further confirm CCoL's generalization under unseen and noisy object states.
翻译:语言条件化操控通过行为克隆促进人机交互,该方法从人类示范中学习控制策略,并成为具身人工智能的基石。克服序列动作决策中的复合误差是提升行为克隆性能的核心挑战。现有方法通过数据增强、表达性表征或时序抽象来缓解复合误差,但存在物理不连续性和语义-物理错位问题,导致动作克隆不准确和执行间歇性中断。本文提出基于语义-物理对齐的连续视觉-语言-动作协同学习框架,这是一种新型行为克隆框架,能确保时序一致的执行与细粒度语义 grounding。该框架通过视觉、语言及本体感知输入(如机器人内部状态)的连续协同学习,生成鲁棒平滑的动作执行轨迹。同时,我们通过双向交叉注意力机制将语言语义锚定至视觉运动表征,以学习动作生成的上下文信息,成功克服语义-物理错位问题。大量实验表明,CCoL 在三个仿真测试集中平均实现 8.0% 的相对性能提升,在人类示范的双臂插入任务中最高获得 19.2% 的相对增益。在 7 自由度机器人上的真实场景测试进一步验证了 CCoL 在未见及含噪声物体状态下的泛化能力。