While large-scale sequence modeling from offline data has led to impressive performance gains in natural language and image generation, directly translating such ideas to robotics has been challenging. One critical reason for this is that uncurated robot demonstration data, i.e. play data, collected from non-expert human demonstrators are often noisy, diverse, and distributionally multi-modal. This makes extracting useful, task-centric behaviors from such data a difficult generative modeling problem. In this work, we present Conditional Behavior Transformers (C-BeT), a method that combines the multi-modal generation ability of Behavior Transformer with future-conditioned goal specification. On a suite of simulated benchmark tasks, we find that C-BeT improves upon prior state-of-the-art work in learning from play data by an average of 45.7%. Further, we demonstrate for the first time that useful task-centric behaviors can be learned on a real-world robot purely from play data without any task labels or reward information. Robot videos are best viewed on our project website: https://play-to-policy.github.io
翻译:虽然从离线数据中进行大规模序列建模在自然语言和图像生成方面带来了令人印象深刻的绩效收益,但将这类想法直接转化为机器人却具有挑战性。 其中一个关键原因是,从非专家人类示威者那里收集的未经证实的机器人演示数据,即游戏数据,往往吵闹、多样和分布式多模式。 这使得从这些数据中提取有用的、以任务为中心的行为成为一个困难的基因模型问题。 在这项工作中,我们介绍了有条件行为变异器(C-BeT),这是一种将行为变异器的多模式生成能力与未来设定目标规格相结合的方法。 在一系列模拟基准任务中,我们发现C-BeT在以往从游戏数据中学习的最先进工作上取得了平均45.7%的改进。 此外,我们首次展示了在现实世界机器人上纯粹从没有任务标签或奖赏信息就能学到有用的任务中心行为。 机器人视频是我们项目网站( https://play-to-policyal.githubio) 上最好看到的。