PNS: 在困难探索环境中加强学习的人口指导新颖搜索 (PNS: Population-Guided Novelty Search for Reinforcement Learning in Hard Exploration Environments)

Reinforcement Learning (RL) has made remarkable achievements, but it still suffers from inadequate exploration strategies, sparse reward signals, and deceptive reward functions. To alleviate these problems, a Population-guided Novelty Search (PNS) parallel learning method is proposed in this paper. In PNS, the population is divided into multiple sub-populations, each of which has one chief agent and several exploring agents. The chief agent evaluates the policies learned by exploring agents and shares the optimal policy with all sub-populations. The exploring agents learn their policies in collaboration with the guidance of the optimal policy and, simultaneously, upload their policies to the chief agent. To balance exploration and exploitation, the Novelty Search (NS) is employed in every chief agent to encourage policies with high novelty while maximizing per-episode performance. We apply PNS to the twin delayed deep deterministic (TD3) policy gradient algorithm. The effectiveness of PNS to promote exploration and improve performance in continuous control domains is demonstrated in the experimental section. Notably, PNS-TD3 achieves rewards that far exceed the SOTA methods in environments with sparse or delayed reward signals. We also demonstrate that PNS enables robotic agents to learn control policies directly from pixels for sparse-reward manipulation in both simulated and real-world settings.

翻译：强化学习(RL)取得了显著成就,但是它仍然受到不适当的探索战略、微弱的奖赏信号和欺骗性的奖励功能的困扰。为了缓解这些问题,本文件建议采用人口引导新手搜索(PNS)平行学习方法。在PNS中,人口分为多个亚群体,每个亚群体有一个首席代理和几个探索代理。首席代理评估通过探索代理所学的政策,并与所有亚群体分享最佳政策。探索代理商在最佳政策的指导下学习其政策,同时将其政策上传给首席代理商。为了平衡探索和开发,每个首席代理商都采用新颖搜索(NS)鼓励高创新政策,同时最大限度地实现每个小客户业绩。我们用PNS(NS)来应用双延迟的深度威慑(TD3)政策梯度算法。PNS(PNS-TD3)在连续控制领域促进探索和改进业绩的有效性在实验一节中得到了证明。值得注意的是,PNS-TD3取得了远远超出STA方法的奖励,在环境里空或延迟的奖励信号。我们还展示了微级代理商能够直接控制真实操作。