This paper investigates the use of prior computation to estimate the value function to improve sample efficiency in on-policy policy gradient methods in reinforcement learning. Our approach is to estimate the value function from prior computations, such as from the Q-network learned in DQN or the value function trained for different but related environments. In particular, we learn a new value function for the target task while combining it with a value estimate from the prior computation. Finally, the resulting value function is used as a baseline in the policy gradient method. This use of a baseline has the theoretical property of reducing variance in gradient computation and thus improving sample efficiency. The experiments show the successful use of prior value estimates in various settings and improved sample efficiency in several tasks.
翻译:本文件调查使用先前计算方法来估计价值功能,以提高强化学习中政策政策梯度方法的抽样效率。我们的方法是估计以前计算方法的数值函数,例如从在DQN中学习的Q网络或为不同但相关的环境培训的数值函数得出的数值函数。特别是,我们学习了目标任务的新值函数,同时将其与先前计算得出的数值估计数结合起来。最后,由此得出的价值函数被用作政策梯度方法的基准。使用基准的理论属性是减少梯度计算的差异,从而提高抽样效率。实验表明,在各种环境中成功使用先前的数值估计数,并在一些任务中提高抽样效率。