A rich representation is key to general robotic manipulation, but existing model architectures require a lot of data to learn it. Unfortunately, ideal robotic manipulation training data, which comes in the form of expert visuomotor demonstrations for a variety of annotated tasks, is scarce. In this work we propose PLEX, a transformer-based architecture that learns from task-agnostic visuomotor trajectories accompanied by a much larger amount of task-conditioned object manipulation videos -- a type of robotics-relevant data available in quantity. The key insight behind PLEX is that the trajectories with observations and actions help induce a latent feature space and train a robot to execute task-agnostic manipulation routines, while a diverse set of video-only demonstrations can efficiently teach the robot how to plan in this feature space for a wide variety of tasks. In contrast to most works on robotic manipulation pretraining, PLEX learns a generalizable sensorimotor multi-task policy, not just an observational representation. We also show that using relative positional encoding in PLEX's transformers further increases its data efficiency when learning from human-collected demonstrations. Experiments showcase \appr's generalization on Meta-World-v2 benchmark and establish state-of-the-art performance in challenging Robosuite environments.
翻译:丰富的代表性是一般机器人操作的关键,但现有的模型结构需要大量数据才能了解它。 不幸的是,理想的机器人操纵培训数据(以专家比武摩托演示的形式为各种附加说明的任务提供)很少。在这项工作中,我们提出了基于变压器的架构(PLEX),它从任务-不可知的比武摩车轨迹中学习,伴之以大量任务-受托的物体操纵视频(一种与机器人相关的数量数据类型)。PLEX背后的关键洞察力是,带有观测和行动轨迹的轨迹有助于诱发潜在特征空间,并训练机器人执行任务-不可知的操作常规,而一套不同的只用视频显示可以有效地教导机器人如何在这个特性空间规划范围广泛的任务。与大多数关于机器人操纵前训练的工程相比,PLEX学习一般感官或多任务操纵政策,而不只是观察性能。我们还显示,在PLEX变压器的轨迹中,使用相对的位置编码,有助于在从具有挑战性能的模型-World-Appr 演示环境中学习时进一步提高其数据效率。</s>