It has been a recent trend to leverage the power of supervised learning (SL) towards more effective reinforcement learning (RL) methods. We propose a novel phasic approach by alternating online RL and offline SL for tackling sparse-reward goal-conditioned problems. In the online phase, we perform RL training and collect rollout data while in the offline phase, we perform SL on those successful trajectories from the dataset. To further improve sample efficiency, we adopt additional techniques in the online phase including task reduction to generate more feasible trajectories and a value-difference-based intrinsic reward to alleviate the sparse-reward issue. We call this overall algorithm, PhAsic self-Imitative Reduction (PAIR). PAIR substantially outperforms both non-phasic RL and phasic SL baselines on sparse-reward goal-conditioned robotic control problems, including a challenging stacking task. PAIR is the first RL method that learns to stack 6 cubes with only 0/1 success rewards from scratch.
翻译:最近的趋势是利用监督学习(SL)的力量来更有效地强化学习(RL)方法。我们建议采用新的渐进方法,在网上和离线SL之间互换RL和离线SL,以解决微量回报目标限制问题。在在线阶段,我们进行RL培训和收集推出数据,而在离线阶段,我们对数据集的成功轨道执行SL。为了进一步提高抽样效率,我们在在线阶段采用额外技术,包括任务削减,以产生更可行的轨迹和基于价值差异的内在奖赏,以缓解稀释问题。我们称之为总体算法,PhAsic自我减缓(PAIR),PAIR大大超越了对稀度目标限制机器人控制问题的非偏差RL和Pasic SL基线,包括具有挑战性的堆叠任务。PAIR是学习堆放6个立方块的第一个RL方法,只有0/1的刮痕成功奖赏。