For reinforcement learning (RL), it is challenging for an agent to master a task that requires a specific series of actions due to sparse rewards. To solve this problem, reverse curriculum generation (RCG) provides a reverse expansion approach that automatically generates a curriculum for the agent to learn. More specifically, RCG adapts the initial state distribution from the neighborhood of a goal to a distance as training proceeds. However, the initial state distribution generated for each iteration might be biased, thus making the policy overfit or slowing down the reverse expansion rate. While training RCG for actor-critic (AC) based RL algorithms, this poor generalization and slow convergence might be induced by the tight coupling between an AC pair. Therefore, we propose a parallelized approach that simultaneously trains multiple AC pairs and periodically exchanges their critics. We empirically demonstrate that this proposed approach can improve RCG in performance and convergence, and it can also be applied to other AC based RL algorithms with adapted initial state distribution.
翻译:对于强化学习(RL)来说,对于代理人来说,要完成一项需要因微薄的回报而采取一系列特定行动的任务是具有挑战性的。为了解决这个问题,反向课程生成(RCG)提供了一种反向扩展方法,自动生成代理学习的课程。更具体地说,RCG将目标周围最初的州分布调整为距离,随着培训的进行。然而,每次迭代最初产生的州分布可能存在偏向性,从而使政策过度适应或减缓反向扩展率。虽然为基于演员-批评(AC)的RL算法培训RCG,但这种不完善的概括化和缓慢趋同可能是由于对AC的对配对之间的紧密结合引起的。因此,我们提出了一种平行方法,即同时培训多个AC对配方并定期交换其批评者。我们从经验上证明,这一拟议方法可以提高REG的性能和趋同性,也可以适用于基于AC的其他基于演员的RL算法,并经过调整的初始状态分布。