Q-learning suffers from overestimation bias, because it approximates the maximum action value using the maximum estimated action value. Algorithms have been proposed to reduce overestimation bias, but we lack an understanding of how bias interacts with performance, and the extent to which existing algorithms mitigate bias. In this paper, we 1) highlight that the effect of overestimation bias on learning efficiency is environment-dependent; 2) propose a generalization of Q-learning, called \emph{Maxmin Q-learning}, which provides a parameter to flexibly control bias; 3) show theoretically that there exists a parameter choice for Maxmin Q-learning that leads to unbiased estimation with a lower approximation variance than Q-learning; and 4) prove the convergence of our algorithm in the tabular case, as well as convergence of several previous Q-learning variants, using a novel Generalized Q-learning framework. We empirically verify that our algorithm better controls estimation bias in toy environments, and that it achieves superior performance on several benchmark problems.
翻译:Q- 学习存在高估偏差, 因为它使用最大估计动作价值来接近最大行动价值。 已经提出了计算法, 以减少高估偏差, 但我们不了解偏差如何与业绩相互作用, 以及现有算法在多大程度上减轻偏差。 在本文中, 我们强调高估偏差对学习效率的影响取决于环境; 2) 提议对Q- 学习进行概括化, 称为 emph{ Maxmin Q- learning}, 提供灵活控制偏差的参数; 3) 理论上表明, Maxmin Q 学习有一个参数选择, 导致不偏袒的估计, 其近似差异小于 Q- 学习; 4) 证明我们算法在表案上的趋同, 以及以前几个Q- 学习变体的趋同, 使用新的通用的Q- 学习框架。 我们从经验上证实, 我们的算法更好地控制了对致命环境中的偏差的估计, 并且它在若干基准问题上取得了优的业绩。