Deep reinforcement learning has gained a lot of success with the presence of trust region policy optimization (TRPO) and proximal policy optimization (PPO), for their scalability and efficiency. However, the pessimism of both algorithms, among which it either is constrained in a trust region or strictly excludes all suspicious gradients, has been proven to suppress the exploration and harm the performance of the agent. To address those issues, we propose a shifted Markov decision process (MDP), or rather, with entropy augmentation, to encourage the exploration and reinforce the ability of escaping from suboptimums. Our method is extensible and adapts to either reward shaping or bootstrapping. With convergence analysis given, we find it is crucial to control the temperature coefficient. However, if appropriately tuning it, we can achieve remarkable performance, even on other algorithms, since it is simple yet effective. Our experiments test augmented TRPO and PPO on MuJoCo benchmark tasks, of an indication that the agent is heartened towards higher reward regions, and enjoys a balance between exploration and exploitation. We verify the exploration bonus of our method on two grid world environments.
翻译:深入强化学习因信任区域政策优化(TRPO)和准政策优化(PPO)的可伸缩性和效率而取得了许多成功,然而,这两种算法的悲观主义,无论是在信任区域受到限制,还是严格排除所有可疑梯度,都证明抑制了对代理人的探索和损害。为了解决这些问题,我们提议采用一个改变的Markov决策程序(MDP),或者使用增试器,鼓励勘探和加强从次优区域逃脱的能力。我们的方法是可以推广的,并且适应于对制导或制导靴的奖励。根据对趋同的分析,我们发现控制温度系数至关重要。但是,如果适当调整它,我们就可以取得显著的成绩,即使是在其他算法上,因为它既简单又有效。我们的实验试验加强了MuJoCo基准任务上的TRPO和PPO,表明代理人正在向更高的奖励区域感到振奋,并保持勘探与开发之间的平衡。我们核查了我们方法在两个电网环境上的勘探红利。