The framework of deep reinforcement learning (DRL) provides a powerful and widely applicable mathematical formalization for sequential decision-making. In this paper, we start from studying the f-divergence between learning policy and sampling policy and derive a novel DRL framework, termed f-Divergence Reinforcement Learning (FRL). We highlight that the policy evaluation and policy improvement phases are induced by minimizing f-divergence between learning policy and sampling policy, which is distinct from the conventional DRL algorithm objective that maximizes the expected cumulative rewards. Besides, we convert this framework to a saddle-point optimization problem with a specific f function through Fenchel conjugate, which consists of policy evaluation and policy improvement. Then we derive new policy evaluation and policy improvement methods in FRL. Our framework may give new insights for analyzing DRL algorithms. The FRL framework achieves two advantages: (1) policy evaluation and policy improvement processes are derived simultaneously by f-divergence; (2) overestimation issue of value function are alleviated. To evaluate the effectiveness of the FRL framework, we conduct experiments on Atari 2600 video games, which show that our framework matches or surpasses the DRL algorithms we tested.
翻译:深度强化学习框架(DRL)为连续决策提供了强有力和广泛应用的数学正规化框架(DRL),为连续决策提供了强大和广泛应用的数学正规化框架。在本文中,我们从研究学习政策和抽样政策之间的差别开始,并推出新的DRL框架,称为F-Divegence加强学习(FRL),我们强调,政策评价和政策改进阶段的诱因是尽量减少学习政策和抽样政策之间的差别,这与常规的DRL算法目标不同,后者使预期的累积收益最大化。此外,我们通过Fenchel conjugate(包括政策评估和政策改进),将这一框架转换为一个具有特定功能的峰值优化问题。然后,我们在Frenchel(FR)中提出新的政策评价和政策改进方法。我们的框架可以为分析DRL算法提供新的见解。 FRL框架有两个好处:(1) 政策评价和政策改进过程由f-divegence(fiverence)同时产生;(2) 价值的过高问题得到缓解。为了评估FRL框架的有效性,我们在Atari视频游戏上进行实验,显示我们的框架匹配或超过我们RDRRL。