Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. However, when value functions are updated to track the learned policy, they forget potentially useful information about old policies. We introduce a class of value functions called Parameter-Based Value Functions (PBVFs) whose inputs include the policy parameters. They can generalize across different policies. PBVFs can evaluate the performance of any policy given a state, a state-action pair, or a distribution over the RL agent's initial states. First we show how PBVFs yield novel off-policy policy gradient theorems. Then we derive off-policy actor-critic algorithms based on PBVFs trained by Monte Carlo or Temporal Difference methods. We show how learned PBVFs can zero-shot learn new policies that outperform any policy seen during training. Finally our algorithms are evaluated on a selection of discrete and continuous control tasks using shallow policies and deep neural networks. Their performance is comparable to state-of-the-art methods.
翻译:传统的非政策性行为者强化学习(RL)算法学习单一目标政策的价值功能。 但是, 当更新价值函数以跟踪所学政策时, 它们忘记了有关旧政策的潜在有用信息。 我们引入了一类价值函数, 称为参数值函数, 其投入包括政策参数。 它们可以分布于不同的政策。 PBVF 可以评估任何给定状态、 州行动对或RL代理商初始状态的分布的绩效。 首先, 我们展示了 PBVFs 是如何产生新颖的非政策性政策梯度理论的。 然后我们根据蒙特卡洛或时空多变方法培训的PBVFs 得出了非政策性行为者- 论算法。 我们展示了所学的PBVFs 如何零发新政策超越了培训期间所看到的任何政策。 最后, 我们的算法是用浅度政策和深层神经网络选择的离散和连续控制任务。 它们的表现可以与最先进的方法相比。