Policy regularization methods such as maximum entropy regularization are widely used in reinforcement learning to improve the robustness of a learned policy. In this paper, we show how this robustness arises from hedging against worst-case perturbations of the reward function, which are chosen from a limited set by an imagined adversary. Using convex duality, we characterize this robust set of adversarial reward perturbations under KL- and {\alpha}-divergence regularization, which includes Shannon and Tsallis entropy regularization as special cases. Importantly, generalization guarantees can be given within this robust set. We provide detailed discussion of the worst-case reward perturbations, and present intuitive empirical examples to illustrate this robustness and its relationship with generalization. Finally, we discuss how our analysis complements and extends previous results on adversarial reward robustness and path consistency optimality conditions.
翻译:在强化学习中广泛使用政策规范化方法,如最大温室性规范化等,以提高学习政策的稳健性。在本文件中,我们展示了这种稳健性是如何从防范最坏情况干扰奖励功能中产生的,而奖励功能是从一个想象中的对手的有限组合中挑选出来的。我们用共性双重性来描述在KL-和thalphy}-温室性规范化下这组强健的对抗性奖励干扰,其中包括香农和Tsallis entropy规范化作为特殊案例。重要的是,可以在这一稳健的组合中提供一般化保障。我们详细讨论了最坏情况的奖励渗透性,并提出了直观的经验实例,以说明这种稳健性及其与概括性的关系。最后,我们讨论了我们的分析如何补充和扩大关于对抗性奖励稳健性和路径一致性的最佳条件的以往结果。