We propose a novel algorithm named Expert Q-learning. Expert Q-learning was inspired by Dueling Q-learning and aimed at incorporating the ideas from semi-supervised learning into reinforcement learning through splitting Q-values into state values and action advantages. Different from Generative Adversarial Imitation Learning and Deep Q-Learning from Demonstrations, the offline expert we have used only predicts the value of a state from {-1, 0, 1}, indicating whether this is a bad, neutral or good state. An expert network was designed in addition to the Q-network, which updates each time following the regular offline minibatch update whenever the expert example buffer is not empty. The Q-network plays the role of the advantage function only during the update. Our algorithm also keeps asynchronous copies of the Q-network and expert network, predicting the target values using the same manner as of Double Q-learning. We compared on the game of Othello our algorithm with the state-of-the-art Q-learning algorithm, which was a combination of Double Q-learning and Dueling Q-learning. The results showed that Expert Q-learning was indeed useful and more resistant to the overestimation bias of Q-learning. The baseline Q-learning algorithm exhibited unstable and suboptimal behavior, especially when playing against a stochastic player, whereas Expert Q-learning demonstrated more robust performance with higher scores. Expert Q-learning without using examples has also gained better results than the baseline algorithm when trained and tested against a fixed player. On the other hand, Expert Q-learning without examples cannot win against the baseline Q-learning algorithm in direct game competitions despite the fact that it has also shown the strength of reducing the overestimation bias.
翻译:我们提议了一个名为专家Q- 学习的新算法。 专家Q- 学习的灵感来自 迪林 Q- 学习, 目的是通过将Q值分成州值和动作优势,将半监督的学习理念纳入强化学习, 通过将Q值分成州值和动作优势, 将半监督的学习理念纳入强化学习。 不同于General Adversarial Limitation 和深Q Learning from 演示, 我们使用的离线专家只用 {-1, 0, 1} 来预测州值, 表明这是否是一个坏、 中性或好状态。 除了Q- 网络之外, 也设计了一个专家网络, 每当专家模拟定期离轨的离轨微调更新后, Q- 网络只能发挥优势功能的作用。 我们的算法还保留了Q- 网络和专家网络的同步副本, 使用双向学习的方式预测了目标值值值。 我们比较了Othello 的算法与州级的双向级的双向学习算算法, 专家在不通过 Q- 专家的更精度的排序上更精益的计算, Q- 也展示了比专家在不断学习的排序中更精益的学习的排序中, Q- 也显示了比专家在不断的排序的学习的排序的排序的排序的学习的成绩- Q- Q- 学习中, 学习中, 学习过程的成绩- 学习的成绩- Q- 学习过程的成绩, 学习过程的成绩, 在不比实际的成绩更精进进进进进进进进进进进进进进进进进进进进进进进进进进到更进进进进进进进进进进进进进进进进进进进进进进进进。 Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- 。