Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. However, when value functions are updated to track the learned policy, they forget potentially useful information about old policies. We introduce a class of value functions called Parameter-based Value Functions (PVFs) whose inputs include the policy parameters. They can generalize across different policies. PVFs can evaluate the performance of any policy given a state, a state-action pair, or a distribution over the RL agent's initial states. First we show how PVFs yield novel off-policy policy gradient theorems. Then we derive off-policy actor-critic algorithms based on PVFs trained by Monte Carlo or Temporal Difference methods. We show how learned PVFs can zero-shot learn new policies that outperform any policy seen during training. Finally our algorithms are evaluated on a selection of discrete and continuous control tasks using shallow policies and deep neural networks. Their performance is comparable to state-of-the-art methods.
翻译:传统政策外的行为体- 强化学习( RL) 算法学习单一目标政策的价值功能。 但是, 当更新值函数以跟踪所学政策时, 它们忘记了有关旧政策的潜在有用信息。 我们引入了一类价值函数, 称为基于参数的价值观函数(PVF), 其投入包括政策参数。 它们可以分布在不同的政策中。 PVF可以评估任何政策给定状态、 州- 行动对等或RL代理商初始状态的分布的绩效。 首先, 我们展示了PVF 是如何产生新颖的政策脱政策梯度定律的。 然后, 我们根据蒙特卡洛或时空差异方法培训的PVF, 得出了非政策性的行为体- 批评算法。 我们展示了所学的PVF 如何零光学新政策, 超越了培训期间所看到的任何政策。 最后, 我们的算法是用浅度政策和深神经网络选择的离散和连续控制任务。 它们的表现可以与最先进的方法相比。