视觉-语言-策略模型：面向动态机器人任务规划 (Vision-Language-Policy Model for Dynamic Robot Task Planning)

Bridging the gap between natural language commands and autonomous execution in unstructured environments remains an open challenge for robotics. This requires robots to perceive and reason over the current task scene through multiple modalities, and to plan their behaviors to achieve their intended goals. Traditional robotic task-planning approaches often struggle to bridge low-level execution with high-level task reasoning, and cannot dynamically update task strategies when instructions change during execution, which ultimately limits their versatility and adaptability to new tasks. In this work, we propose a novel language model-based framework for dynamic robot task planning. Our Vision-Language-Policy (VLP) model, based on a vision-language model fine-tuned on real-world data, can interpret semantic instructions and integrate reasoning over the current task scene to generate behavior policies that control the robot to accomplish the task. Moreover, it can dynamically adjust the task strategy in response to changes in the task, enabling flexible adaptation to evolving task requirements. Experiments conducted with different robots and a variety of real-world tasks show that the trained model can efficiently adapt to novel scenarios and dynamically update its policy, demonstrating strong planning autonomy and cross-embodiment generalization. Videos: https://robovlp.github.io/

翻译：在非结构化环境中弥合自然语言指令与自主执行之间的鸿沟，仍然是机器人学领域一个悬而未决的挑战。这要求机器人能够通过多模态感知和理解当前任务场景，并规划其行为以实现既定目标。传统的机器人任务规划方法往往难以将底层执行与高层任务推理有效衔接，且无法在执行过程中因指令变化而动态更新任务策略，这最终限制了其应对新任务的通用性和适应性。本文提出了一种基于语言模型的新型动态机器人任务规划框架。我们的视觉-语言-策略模型基于在真实世界数据上微调的视觉-语言模型，能够解析语义指令，并结合对当前任务场景的推理，生成控制机器人完成任务的行为策略。此外，该模型能够根据任务变化动态调整任务策略，从而灵活适应不断演化的任务需求。在不同机器人及多种真实世界任务中进行的实验表明，经过训练的模型能够高效适应新场景并动态更新其策略，展现出强大的规划自主性和跨具身泛化能力。视频：https://robovlp.github.io/