Deep reinforcement learning (RL) methods generally engage in exploratory behavior through noise injection in the action space. An alternative is to add noise directly to the agent's parameters, which can lead to more consistent exploration and a richer set of behaviors. Methods such as evolutionary strategies use parameter perturbations, but discard all temporal structure in the process and require significantly more samples. Combining parameter noise with traditional RL methods allows to combine the best of both worlds. We demonstrate that both off- and on-policy methods benefit from this approach through experimental comparison of DQN, DDPG, and TRPO on high-dimensional discrete action environments as well as continuous control tasks. Our results show that RL with parameter noise learns more efficiently than traditional RL with action space noise and evolutionary strategies individually.
翻译:深强化学习(RL)方法通常通过在活动空间注入噪音进行探索性行为。另一种办法是直接在代理人的参数上添加噪音,这可能导致更一致的探索和更丰富的行为。进化战略等方法使用参数扰动,但在此过程中抛弃所有时间结构,需要更多样本。将参数噪音与传统的RL方法结合起来,可以将世界的最佳因素结合起来。我们证明,通过在高维离散行动环境以及连续控制任务上对DQN、DDPG和TRPO进行实验性比较,离散和政策上的方法都从这一方法中受益。我们的结果显示,带参数噪音的RL比传统的RL更高效地学习行动空间噪音和进化战略。