Although Deep Reinforcement Learning (DRL) has been popular in many disciplines including robotics, state-of-the-art DRL algorithms still struggle to learn long-horizon, multi-step and sparse reward tasks, such as stacking several blocks given only a task-completion reward signal. To improve learning efficiency for such tasks, this paper proposes a DRL exploration technique, termed A^2, which integrates two components inspired by human experiences: Abstract demonstrations and Adaptive exploration. A^2 starts by decomposing a complex task into subtasks, and then provides the correct orders of subtasks to learn. During training, the agent explores the environment adaptively, acting more deterministically for well-mastered subtasks and more stochastically for ill-learnt subtasks. Ablation and comparative experiments are conducted on several grid-world tasks and three robotic manipulation tasks. We demonstrate that A^2 can aid popular DRL algorithms (DQN, DDPG, and SAC) to learn more efficiently and stably in these environments.
翻译:虽然深强化学习(DRL)在许多学科都很受欢迎,包括机器人,但最先进的DRL算法仍然在努力学习长正方位、多步和稀有的奖励任务,例如堆叠几个街区,只给任务完成的奖赏信号。为了提高这些任务的学习效率,本文件提议了DRL探索技术,称为A ⁇ 2,该技术结合了人类经验激发的两个组成部分:抽象演示和适应性探索。A ⁇ 2开始将复杂的任务分解成子任务,然后提供正确的子任务顺序。在培训期间,代理商对环境进行了适应性探索,对精密的子任务采取了更果断的行动,对不熟练的子任务采取了更慎重的态度。在几个网格世界任务和三个机器人操纵任务上进行了调整和比较实验。我们证明A ⁇ 2可以帮助流行的DL算法(DQN、DDPG和SAC)更高效和更精确地在这些环境中学习。