Entropic regularization of policies in Reinforcement Learning (RL) is a commonly used heuristic to ensure that the learned policy explores the state-space sufficiently before overfitting to a local optimal policy. The primary motivation for using entropy is for exploration and disambiguating optimal policies; however, the theoretical effects are not entirely understood. In this work, we study the more general regularized RL objective and using Fenchel duality; we derive the dual problem which takes the form of an adversarial reward problem. In particular, we find that the optimal policy found by a regularized objective is precisely an optimal policy of a reinforcement learning problem under a worst-case adversarial reward. Our result allows us to reinterpret the popular entropic regularization scheme as a form of robustification. Furthermore, due to the generality of our results, we apply to other existing regularization schemes. Our results thus give insights into the effects of regularization of policies and deepen our understanding of exploration through robust rewards at large.
翻译:强化学习(RL)政策的全面正规化是一种常用的理论学,以确保学习的政策在过度适应当地最佳政策之前充分探索国家空间。使用英特罗比的主要动机是探索和扭曲最佳政策;然而,理论效果并不完全理解。在这项工作中,我们研究了更为普遍的正规化RL目标,并利用Fenchel的双重性;我们得出了以对抗性奖励问题为形式的双重问题。特别是,我们发现,正规化目标所发现的最佳政策正是在最坏的对抗性奖励下强化学习问题的最佳政策。我们的结果使我们能够重新将流行的英特罗比正规化计划作为强健化的一种形式加以解释。此外,由于我们的结果很笼统,我们应用到其他现有的正规化计划。我们的结果由此揭示了政策正规化的影响,并通过总体的有力奖励加深了我们对勘探的理解。