Despite the increasing popularity of policy gradient methods, they are yet to be widely utilized in sample-scarce applications, such as robotics. The sample efficiency could be improved by making best usage of available information. As a key component in reinforcement learning, the reward function is usually devised carefully to guide the agent. Hence, the reward function is usually known, allowing access to not only scalar reward signals but also reward gradients. To benefit from reward gradients, previous works require the knowledge of environment dynamics, which are hard to obtain. In this work, we develop the \textit{Reward Policy Gradient} estimator, a novel approach that integrates reward gradients without learning a model. Bypassing the model dynamics allows our estimator to achieve a better bias-variance trade-off, which results in a higher sample efficiency, as shown in the empirical analysis. Our method also boosts the performance of Proximal Policy Optimization on different MuJoCo control tasks.
翻译:尽管政策梯度方法越来越受人欢迎,但是在诸如机器人等抽样偏差应用中,它们尚未被广泛使用。 样本效率可以通过最佳利用现有信息加以提高。 作为强化学习的一个关键组成部分,奖励功能通常是精心设计的,以指导代理人。 因此,奖励功能通常为人所知,不仅允许获取卡路里奖励信号,而且奖励梯度。为了从奖励梯度中受益,以往的工作需要环境动态知识,而这种知识是难以获得的。 在这项工作中,我们开发了\ textit{Reward Policy Gradient}估计器,这是一种新颖的方法,将奖励梯度结合起来而不学习模型。 通过这种模型动态,我们的估计者可以实现更好的偏差权衡,从而产生更高的抽样效率,正如经验分析所显示的那样。 我们的方法还促进了不同 MuJoCo控制任务中Proximal政策最佳化的绩效。