The current reinforcement learning algorithm uses forward-generated trajectories to train the agent. The forward-generated trajectories give the agent little guidance, so the agent can explore as much as possible. While the appreciation of reinforcement learning comes from enough exploration, this gives the trade-off of losing sample efficiency. The sampling efficiency is an important factor that decides the performance of the algorithm. Past tasks use reward shaping techniques and changing the structure of the network to increase sample efficiency, however these methods require many steps to implement. In this work, we propose novel reverse curriculum reinforcement learning. Reverse curriculum learning starts training the agent using the backward trajectory of the episode rather than the original forward trajectory. This gives the agent a strong reward signal, so the agent can learn in a more sample-efficient manner. Moreover, our method only requires a minor change in algorithm, which is reversing the order of trajectory before training the agent. Therefore, it can be simply applied to any state-of-art algorithms.
翻译:当前的强化学习算法使用前向生成的轨迹来训练代理。 前向生成的轨迹没有给代理提供多少指导, 所以代理商可以尽可能地探索。 虽然对强化学习的赞赏来自足够的探索, 但这提供了样本效率下降的权衡。 抽样效率是决定算法性能的一个重要因素。 过去的任务使用奖励塑造技术和改变网络结构以提高样本效率, 但是这些方法需要许多步骤才能实施。 在此工作中, 我们建议采用新的反向课程强化学习。 反向课程学习开始用事件后向轨迹而不是原始前向轨迹来训练代理商。 这给代理商一个强大的奖赏信号, 以便代理商能够以更高效的方式学习。 此外, 我们的方法只需要略微改变算法, 也就是在培训代理商之前改变轨迹顺序。 因此, 它可以简单地应用到任何最先进的算法 。