Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings. While recent generative models can synthesize useful data for individual tasks, they do not exploit the compositional structure of robotic domains and struggle to generalize to unseen task combinations. We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Once trained on a limited subset of tasks, we show that our model can zero-shot generate high-quality transitions from which we can learn control policies for unseen task combinations. Then, we introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance over monolithic and hard-coded compositional baselines, ultimately solving nearly all held-out tasks and demonstrating the emergence of meaningful compositional structure in the learned representations.
翻译:收集机器人操作数据成本高昂,使得为多物体、多机器人和多环境场景中组合爆炸的任务空间获取演示数据变得不切实际。尽管当前生成模型能够为独立任务合成有用数据,但这些模型未能利用机器人领域的组合结构,难以泛化至未见过的任务组合。我们提出一种语义组合扩散Transformer模型,该模型将状态转移分解为机器人、物体、障碍物和目标特定组件,并通过注意力机制学习其交互关系。在有限任务子集上训练后,我们的模型能够零样本生成高质量状态转移数据,并据此学习未见任务组合的控制策略。进一步,我们提出迭代式自我改进流程:通过离线强化学习验证合成数据,并将其纳入后续训练轮次。相较于整体式和硬编码组合基线方法,我们的方案显著提升了零样本性能,最终解决了几乎所有预留任务,并在学习表征中展现出有意义的组合结构涌现。