Recent progress on vision-language foundation models have brought significant advancement to building general-purpose robots. By using the pre-trained models to encode the scene and instructions as inputs for decision making, the instruction-conditioned policy can generalize across different objects and tasks. While this is encouraging, the policy still fails in most cases given an unseen task or environment. To adapt the policy to unseen tasks and environments, we explore a new paradigm on leveraging the pre-trained foundation models with Self-PLAY and Self-Describe (SPLAYD). When deploying the trained policy to a new task or a new environment, we first let the policy self-play with randomly generated instructions to record the demonstrations. While the execution could be wrong, we can use the pre-trained foundation models to accurately self-describe (i.e., re-label or classify) the demonstrations. This automatically provides new pairs of demonstration-instruction data for policy fine-tuning. We evaluate our method on a broad range of experiments with the focus on generalization on unseen objects, unseen tasks, unseen environments, and sim-to-real transfer. We show SPLAYD improves baselines by a large margin in all cases. Our project page is available at https://geyuying.github.io/SPLAYD/
翻译:视觉基础模型的近期进展为建设通用机器人带来了重大进步。通过使用经过预先培训的模型将现场和指示编码成成成成文,指导性政策可以对不同的对象和任务进行概括化。虽然这是令人鼓舞的,但在大多数情况下,该政策仍然失败,给一个未知的任务或环境带来了未知的任务或环境。为了使该政策适应于无形的任务和环境,我们探索了利用经过培训的基础模型进行自我定位和自我描述(SPLAYD)的新模式。在将经过培训的政策模型运用到新的任务或新环境时,我们首先让政策自行使用随机生成的指示来记录演示。虽然执行可能是错误的,但我们可以使用经过培训的基础模型来准确地进行自我描述(例如,重新标签或分类)演示。这自动为政策微调提供了新的示范性指导数据配对。我们评估了广泛的实验方法,重点是对看不见的物体、看不见的任务、看不见的环境和模拟到真实的传输。我们在大页上展示了SAYD/MAYAUAAAAUAU 的所有案例的基线。