We introduce Wonderful Team, a multi-agent Vision Large Language Model (VLLM) framework for executing high level robotic planning in a zero-shot regime. In our context, zero-shot high-level planning means that for a novel environment, we provide a VLLM with an image of the robot's surroundings and a task description, and the VLLM outputs the sequence of actions necessary for the robot to complete the task. Unlike previous methods for high-level visual planning for robotic manipulation, our method uses VLLMs for the entire planning process, enabling a more tightly integrated loop between perception, control, and planning. As a result, Wonderful Team's performance on a real-world semantic and physical planning tasks often exceeds methods that rely on separate vision systems. For example, we see an average 40% success-rate improvement on VimaBench over prior methods such as NLaP, an average 30% improvement over Trajectory Generators on tasks from the Trajectory Generator paper including drawing and wiping a plate, and an average 70% improvement over Trajectory Generators on a new set of semantic reasoning tasks including environment re-arrangement with implicit linguistic constraints. We hope these results highlight the rapid improvements of VLLMs in the past year, and motivate the community to consider VLLMs as an option for some high-level robotic planning problems in the future.
翻译:暂无翻译