Robotics has long been a field riddled with complex systems architectures whose modules and connections, whether traditional or learning-based, require significant human expertise and prior knowledge. Inspired by large pre-trained language models, this work introduces a paradigm for pre-training a general purpose representation that can serve as a starting point for multiple tasks on a given robot. We present the Perception-Action Causal Transformer (PACT), a generative transformer-based architecture that aims to build representations directly from robot data in a self-supervised fashion. Through autoregressive prediction of states and actions over time, our model implicitly encodes dynamics and behaviors for a particular robot. Our experimental evaluation focuses on the domain of mobile agents, where we show that this robot-specific representation can function as a single starting point to achieve distinct tasks such as safe navigation, localization and mapping. We evaluate two form factors: a wheeled robot that uses a LiDAR sensor as perception input (MuSHR), and a simulated agent that uses first-person RGB images (Habitat). We show that finetuning small task-specific networks on top of the larger pretrained model results in significantly better performance compared to training a single model from scratch for all tasks simultaneously, and comparable performance to training a separate large model for each task independently. By sharing a common good-quality representation across tasks we can lower overall model capacity and speed up the real-time deployment of such systems.
翻译:长期以来,机器人一直是一个充满复杂的系统架构的田野,其模块和连接,无论是传统还是学习基础,都需要大量的人类专门知识和先前的知识。在大量预先培训的语言模型的启发下,这项工作引入了对一般目的代表进行预培训的模式,可以作为特定机器人多重任务的起点。我们展示了概念-动作视距-视窗变异器(PACT),一个基于基因的变异器(PACT),一个基于基因的变异器结构,目的是直接从机器人数据中以自我监督的方式建立代表。通过对一段时间内的国家和行动进行自动递增预测,我们的模型隐含了特定机器人的动态和行为。我们的实验性评估侧重于移动剂领域。我们在这里显示,这种特定机器人代表可以作为一个单一的起点,用来完成安全导航、本地化和绘图等不同的任务。我们评估两种形式因素:一个使用LIDAR传感器作为感知度输入模型的轮式机器人,以及一个使用第一人RGB图像(Higend)的模拟代理器。我们展示的是,从更精确的小型任务网格网络,从更高级的顶端模型,到一个更高级的模型,用来进行更精确地共享的模型,以便比较整个任务、更精确地共享、更精确地执行。