The framework of deep reinforcement learning (DRL) provides a powerful and widely applicable mathematical formalization for sequential decision-making. This paper present a novel DRL framework, termed \emph{$f$-Divergence Reinforcement Learning (FRL)}. In FRL, the policy evaluation and policy improvement phases are simultaneously performed by minimizing the $f$-divergence between the learning policy and sampling policy, which is distinct from conventional DRL algorithms that aim to maximize the expected cumulative rewards. We theoretically prove that minimizing such $f$-divergence can make the learning policy converge to the optimal policy. Besides, we convert the process of training agents in FRL framework to a saddle-point optimization problem with a specific $f$ function through Fenchel conjugate, which forms new methods for policy evaluation and policy improvement. Through mathematical proofs and empirical evaluation, we demonstrate that the FRL framework has two advantages: (1) policy evaluation and policy improvement processes are performed simultaneously and (2) the issues of overestimating value function are naturally alleviated. To evaluate the effectiveness of the FRL framework, we conduct experiments on Atari 2600 video games and show that agents trained in the FRL framework match or surpass the baseline DRL algorithms.
翻译:深度强化学习框架(DRL)为连续决策提供了一个强有力和广泛应用的数学正规化框架(DRL) 。本文件展示了一个新型的DRL框架,称为 emph{$f$f$-Divegence加强学习(FRL) 。在FRL, 政策评估和政策改进阶段同时进行,将学习政策和抽样政策之间的美元差异最小化,这与传统的DRL算法不同,后者旨在尽量扩大预期的累积收益。我们理论上证明,最大限度地减少这种美元波动可使学习政策与最佳政策趋同。此外,我们还通过Fenchel Conjugate, 将Frencel 培训代理商的过程转换成一个固定点优化问题,以具体的美元为功能,通过Fenchel Conjugate, 形成新的政策评价和政策改进方法。我们通过数学证明和实证评估,证明FRL框架有两个优点:(1) 政策评价和政策改进过程同时进行,以及(2) 高估价值功能的问题自然得到缓解。为了评价FRL框架的有效性,我们在Atari Viral游戏中进行实验,在Atari VirimL Bestal Bestal Bestal Bestal Basimpress上显示经过训练的FRRRRRRML的基底游戏或FRRRMApralBirdal Birdalbirdalbors。