Modeling generalized robot control policies poses ongoing challenges for language-guided robot manipulation tasks. Existing methods often struggle to efficiently utilize cross-dataset resources or rely on resource-intensive vision-language models, thus limiting their multi-task performance and practical applications. In this study, we propose a novel approach that decouples robot action trajectory encoding and control policy generation by leveraging latent action trajectory spaces, enhancing the generalization ability of policy generation on multi-task manipulation tasks. First, we pre-train a task-agnostic auto-encoder to project an action trajectory of several frames accompanied with observations into a latent action trajectory space on large-scale datasets collected with multiple embodiments in various environments. Then we propose learning a diffusion model based on the latent action trajectory space to generate actions of next steps. Through experiments on two widely used benchmarks, results demonstrate that our proposed method outperforms baselines by 7%-29% in terms of average success rate across eight tasks. Our method can consistently benefit from pre-training while baselines cannot. Our method is more than two times faster than our baseline.
翻译:暂无翻译