Maximum entropy (MaxEnt) RL maximizes a combination of the original task reward and an entropy reward. It is believed that the regularization imposed by entropy, on both policy improvement and policy evaluation, together contributes to good exploration, training convergence, and robustness of learned policies. This paper takes a closer look at entropy as an intrinsic reward, by conducting various ablation studies on soft actor-critic (SAC), a popular representative of MaxEnt RL. Our findings reveal that in general, entropy rewards should be applied with caution to policy evaluation. On one hand, the entropy reward, like any other intrinsic reward, could obscure the main task reward if it is not properly managed. We identify some failure cases of the entropy reward especially in episodic Markov decision processes (MDPs), where it could cause the policy to be overly optimistic or pessimistic. On the other hand, our large-scale empirical study shows that using entropy regularization alone in policy improvement, leads to comparable or even better performance and robustness than using it in both policy improvement and policy evaluation. Based on these observations, we recommend either normalizing the entropy reward to a zero mean (SACZero), or simply removing it from policy evaluation (SACLite) for better practical results.
翻译:最大催化( MaxEnt) RL 最大催化( 最大催化( 最大催化) RL 最大化了原始任务奖励和 诱变奖励的组合。 据认为, 在政策改进和政策评价两方面, 由催化奖励施加的正规化, 共同有助于良好的探索、 培训趋同, 以及所学政策的稳健性。 本文更仔细地将催化性奖励看成一种内在的奖赏, 方法是对软性行为者- 批评( SAC) ( MaxEnt RL 的受欢迎代表 MaxEnt RL ) 进行各种通俗化研究。 我们的研究结果显示, 一般来说, 催化性奖励应该谨慎地运用于政策评价。 一方面, 催化奖赏奖励( 与任何其他内在奖励一样), 在政策改进和政策评价方面, 可能会模糊主要任务奖励的主要任务。 我们发现, 催化奖赏奖励的一些失败案例, 特别是在SDRov 决策过程中, 可能会导致政策过于乐观或悲观。 另一方面, 我们建议, 在正常政策评价中, 将它变成更实际的成绩。