您的政策监管者是秘密的对立者 (Your Policy Regularizer is Secretly an Adversary)

Policy regularization methods such as maximum entropy regularization are widely used in reinforcement learning to improve the robustness of a learned policy. In this paper, we show how this robustness arises from hedging against worst-case perturbations of the reward function, which are chosen from a limited set by an imagined adversary. Using convex duality, we characterize this robust set of adversarial reward perturbations under KL and alpha-divergence regularization, which includes Shannon and Tsallis entropy regularization as special cases. Importantly, generalization guarantees can be given within this robust set. We provide detailed discussion of the worst-case reward perturbations, and present intuitive empirical examples to illustrate this robustness and its relationship with generalization. Finally, we discuss how our analysis complements and extends previous results on adversarial reward robustness and path consistency optimality conditions.

翻译：在强化学习中广泛使用政策规范化方法,如最大温室性规范化等,以提高学习政策的稳健性;在本文件中,我们展示了这种稳健性如何产生于防范最坏情况对奖赏功能的干扰,而奖赏功能是从一个想象中的对手所选择的有限组合中挑选出来的。我们用Convex的双重性来描述在KL和甲型抗体性规范化下这组强健的对抗性奖赏干扰,其中包括香农和Tsallis对金质规范化作为特殊案例。重要的是,可以在这一稳健的组合中提供一般化保障。我们详细讨论了最坏情况奖励干扰,并提出直观的经验实例,以说明这种稳健性及其与一般化的关系。最后,我们讨论了我们的分析如何补充和扩展关于对抗性奖赏的稳健性和路径的一致性最佳条件的以往结果。