Dropped into an unknown environment, what should an agent do to quickly learn about the environment and how to accomplish diverse tasks within it? We address this question within the goal-conditioned reinforcement learning paradigm, by identifying how the agent should set its goals at training time to maximize exploration. We propose "Planning Exploratory Goals" (PEG), a method that sets goals for each training episode to directly optimize an intrinsic exploration reward. PEG first chooses goal commands such that the agent's goal-conditioned policy, at its current level of training, will end up in states with high exploration potential. It then launches an exploration policy starting at those promising states. To enable this direct optimization, PEG learns world models and adapts sampling-based planning algorithms to "plan goal commands". In challenging simulated robotics environments including a multi-legged ant robot in a maze, and a robot arm on a cluttered tabletop, PEG exploration enables more efficient and effective training of goal-conditioned policies relative to baselines and ablations. Our ant successfully navigates a long maze, and the robot arm successfully builds a stack of three blocks upon command. Website: https://penn-pal-lab.github.io/peg/
翻译:把一个智能体放入未知环境中时,他应该怎么做才能快速学习这个环境和如何完成不同的任务?我们在目标条件下的强化学习框架内探讨这个问题,通过确定训练时智能体应该如何设置目标以最大化探索。我们提出“Planning Exploratory Goals”(探索性目标规划,PEG)这种方法,为每一轮训练设定目标命令,来直接优化内部的探索奖励。PEG首先选择目标命令,使得当前训练水平下的目标条件策略进入具有较高探索潜力的状态,然后在那些有前途的状态下开始探索策略。为了实现这种直接优化,PEG学习了世界模型,并调节基于采样的规划算法来“规划目标命令”。在具有挑战性的模拟机器人环境中,包括迷宫中的多腿蚂蚁机器人和杂乱桌面上的机器人臂,PEG探索相对于基线和去除试验具有更高效和更有效地训练目标条件策略的能力。我们的蚂蚁成功地穿过了一个长的迷宫,而机器人臂成功地根据命令搭起了三个积木。网站:https://penn-pal-lab.github.io/peg/