Training a game-playing reinforcement learning agent requires multiple interactions with the environment. Ignorant random exploration may cause a waste of time and resources. It's essential to alleviate such waste. As discussed in this paper, under the settings of the off-policy actor critic algorithms, we demonstrate that the critic can bring more expected discounted rewards than or at least equal to the actor. Thus, the Q value predicted by the critic is a better signal to redistribute the action originally sampled from the policy distribution predicted by the actor. This paper introduces the novel Critic Guided Action Redistribution (CGAR) algorithm and tests it on the OpenAI MuJoCo tasks. The experimental results demonstrate that our method improves the sample efficiency and achieves state-of-the-art performance. Our code can be found at https://github.com/tairanhuang/CGAR.
翻译:训练玩游戏强化学习代理器需要与环境进行多重互动。 无知的随机探索可能会造成时间和资源的浪费。 减少这种浪费至关重要 。 如本文所讨论的, 在非政策性行为者批评算法的设置下, 我们证明评论家可以带来更多预期的折扣回报, 而不是或至少与演员相等。 因此, 评论家预测的Q值是一个更好的信号, 以重新分配最初从演员预测的政策分布中抽样的行动。 本文介绍了新颖的Critical 向导再分配算法( CGAR ), 并在 OpenAI MuJoco 任务上测试它。 实验结果显示, 我们的方法提高了样本效率, 并实现了最先进的表现。 我们的代码可以在 https:// github.com/tairanhuang/ CGAR 找到 。