Reinforcement Learning (RL) has made remarkable achievements, but it still suffers from inadequate exploration strategies, sparse reward signals, and deceptive reward functions. These problems motivate the need for a more efficient and directed exploration. For solving this, a Population-guided Novelty Search (PNS) parallel learning method is proposed. In PNS, the population is divided into multiple sub-populations, each of which has one chief agent and several exploring agents. The role of the chief agent is to evaluate the policies learned by exploring agents and to share the optimal policy with all sub-populations. The role of exploring agents is to learn their policies in collaboration with the guidance of the optimal policy and, simultaneously, upload their policies to the chief agent. To balance exploration and exploitation, the Novelty Search (NS) is employed in chief agents to encourage policies with high novelty while maximizing per-episode performance. The introduction of sub-populations and NS mechanisms promote directed exploration and enables better policy search. In the numerical experiment section, the proposed scheme is applied to the twin delayed deep deterministic (TD3) policy gradient algorithm, and the effectiveness of PNS to promote exploration and improve performance in both continuous control domains and discrete control domains is demonstrated. Notably, the proposed method achieves rewards that far exceed the SOTA methods in Delayed MoJoCo environments.
翻译:强化学习(RL)取得了显著成就,但主要代理机构的作用仍然是评估探险人员所学的政策,并与所有亚群体分享最佳政策。探索代理人的作用是学习其政策,配合最佳政策的指导,同时将其政策上传给首席代理机构。为了平衡探索与开发,首席代理机构采用Novellty搜索(NS),主要代理机构鼓励高创新政策,同时最大限度地提高人均绩效。引入亚人口和国家安全局机制,促进定向探索,更好地进行政策搜索。在数字实验部分,拟议计划适用于双延迟的深层确定性(TD3)政策梯度算法,PNS公司在推进连续勘探与开发的监管领域取得成效。