While behavior learning has made impressive progress in recent times, it lags behind computer vision and natural language processing due to its inability to leverage large, human-generated datasets. Human behaviors have wide variance, multiple modes, and human demonstrations typically do not come with reward labels. These properties limit the applicability of current methods in Offline RL and Behavioral Cloning to learn from large, pre-collected datasets. In this work, we present Behavior Transformer (BeT), a new technique to model unlabeled demonstration data with multiple modes. BeT retrofits standard transformer architectures with action discretization coupled with a multi-task action correction inspired by offset prediction in object detection. This allows us to leverage the multi-modal modeling ability of modern transformers to predict multi-modal continuous actions. We experimentally evaluate BeT on a variety of robotic manipulation and self-driving behavior datasets. We show that BeT significantly improves over prior state-of-the-art work on solving demonstrated tasks while capturing the major modes present in the pre-collected datasets. Finally, through an extensive ablation study, we analyze the importance of every crucial component in BeT. Videos of behavior generated by BeT are available at https://notmahi.github.io/bet
翻译:虽然行为学习近些年来取得了令人印象深刻的进展,但它却落后于计算机视野和自然语言处理,原因是无法利用大型的、人为的数据集。人类行为存在巨大的差异、多种模式和人类演示通常不会带来奖赏标签。这些属性限制了离线 RL 和 Bhavior Cloin 中当前方法的适用性,以便从大型的、预收集的数据集中学习。在这项工作中,我们介绍了行为变异器(BeT),这是一种用多种模式模拟无标签的演示数据的新技术。BeT 改造标准变异器结构,配有行动分化,加上由物体探测预测抵消的多任务行动修正。这使我们能够利用现代变异器的多模式模型能力预测多模式的持续行动。我们实验性地评估了各种机器人操作和自我驱动行为数据集的BET。我们展示BeT大大改进了先前在解析所显示的任务方面的状态工作,同时捕捉到预收集数据集中存在的主要模式。最后,我们通过一个至关重要的视频/Benably 分析了每个关键的行为方式。我们制作的MAltimbalal