We propose a exploration mechanism of policy in Deep Reinforcement Learning, which is exploring more when agent needs, called Add Noise to Noise (AN2N). The core idea is: when the Deep Reinforcement Learning agent is in a state of poor performance in history, it needs to explore more. So we use cumulative rewards to evaluate which past states the agents have not performed well, and use cosine distance to measure whether the current state needs to be explored more. This method shows that the exploration mechanism of the agent's policy is conducive to efficient exploration. We combining the proposed exploration mechanism AN2N with Deep Deterministic Policy Gradient (DDPG), Soft Actor-Critic (SAC) algorithms, and apply it to the field of continuous control tasks, such as halfCheetah, Hopper, and Swimmer, achieving considerable improvement in performance and convergence speed.
翻译:我们提出了深强化学习政策探索机制,这个机制正在探索更多何时需要代理人,称为噪音添加噪音(AN2N),核心思想是:当深强化学习代理处于历史表现不佳的状态时,它需要探索更多。因此,我们利用累积奖励来评估代理人过去哪些国家表现不佳,并使用连接距离来衡量是否需要进一步探索当前状态。这种方法表明代理人政策探索机制有利于有效探索。我们把拟议的AN2N与深决定性政策渐进(DPG)、Soft Actor-Critic(SAC)算法(SAC)合并起来,并将其应用于持续控制任务领域,如半Cheetah、Hopper和Swimmer,在性能和趋同速度上取得了相当大的改进。