Based on the recent advancements in representation learning, we propose a novel framework for command-following robots with raw sensor inputs. Previous RL-based methods are either difficult to continuously improve after the deployment or require a large number of new labels during the fine-tuning. Motivated by (self-)supervised contrastive learning literature, we propose a novel representation, named VAR++, that generates an intrinsic reward function for command-following robot tasks by associating images with sound commands. After the robot is deployed in a new domain, the representation can be updated intuitively and data-efficiently by non-experts, and the robot is able to fulfill sound commands without any hand-crafted reward functions. We demonstrate our approach on various sound types and robotic tasks, including navigation and manipulation with raw sensor inputs. In the simulated experiments, we show that our system can continually self-improve in previously unseen scenarios given fewer new labeled data, yet achieves better performance, compared with previous methods.
翻译:根据最近的代表性学习进展,我们提议了一个具有原始传感器投入的指令执行机器人的新框架。以前基于RL的方法要么在部署后难以不断改进,要么在微调期间需要大量新标签。我们提议了一个名为VAR++的新表述,通过将图像与声音命令联系起来,为指令执行机器人任务产生内在的奖赏功能。在将图像与声音命令联系起来,机器人在新领域部署后,非专家可以直观地和有效地更新数据,机器人可以在没有任何手工制作的奖赏功能的情况下完成声音命令。我们展示了我们对于各种声音类型和机器人任务的方法,包括原始传感器投入的导航和操作。在模拟实验中,我们展示了我们的系统可以持续在先前看不见的情景中自我解释,因为新的标签数据较少,但与以往的方法相比,其性能更好。