The research on embodied AI has greatly promoted the development of robot manipulation. However, it still faces significant challenges in various aspects such as benchmark construction, multi-modal perception and decision-making, and physical execution. Previous robot manipulation simulators were primarily designed to enrich manipulation types and types of objects while neglecting the balance between physical manipulation and language instruction complexity in multi-modal environments. This paper proposes a new robot manipulation simulator and builds a comprehensive and systematic robot manipulation benchmark with progressive reasoning tasks called SeaWave (i.e., a progressive reasoning benchmark). It provides a standard test platform for embedded AI agents in a multi-modal environment, which can evaluate and execute four levels of human natural language instructions at the same time. Previous world model-based robot manipulation work lacked research on the perception and decision-making of complex instructions in multi-modal environments. To this end, we propose a new world model tailored for cross-modal robot manipulation called DamWorld. Specifically, DamWorld takes the current visual scene and predicted execution actions based on natural language instructions as input, and uses the next action frame to supervise the output of the world model to force the model to learn robot manipulation consistent with world knowledge. Compared with the renowned baselines (e.g., RT-1), our DamWorld improves the manipulation success rate by 5.6% on average on four levels of progressive reasoning tasks. It is worth noting that on the most challenging level 4 manipulation task, DamWorld still improved by 9.0% compared to prior works.
翻译:暂无翻译