Improving the generalization capabilities of general-purpose robotic manipulation agents in the real world has long been a significant challenge. Existing approaches often rely on collecting large-scale robotic data which is costly and time-consuming, such as the RT-1 dataset. However, due to insufficient diversity of data, these approaches typically suffer from limiting their capability in open-domain scenarios with new objects, and diverse environments. In this paper, we propose a novel paradigm that effectively leverages language grounded segmentation mask generated by Internet-scale foundation models, to address a wide range of pick-and-place robot manipulation tasks. By integrating the mask modality, which incorporates semantic, geometric, and temporal correlation priors derived from vision foundation models, into the end-to-end policy model, our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning, including new object instances, semantic categories, and unseen backgrounds. We first introduce a series of foundation models to ground natural language demands across multiple tasks. Secondly, we develop a two-stream 2D policy model based on imitation learning, which utilizes raw images, object masks, and robot proprioception to predict robot actions. Extensive real-world experiments conducted on a Franka Emika robot arm demonstrate the effectiveness of our proposed paradigm. Demos are shown in YouTube (https://www.youtube.com/watch?v=MAcUPFBfRIw ).
翻译:暂无翻译