The optimistic nature of the Q-learning target leads to an overestimation bias, which is an inherent problem associated with standard $Q-$learning. Such a bias fails to account for the possibility of low returns, particularly in risky scenarios. However, the existence of biases, whether overestimation or underestimation, need not necessarily be undesirable. In this paper, we analytically examine the utility of biased learning, and show that specific types of biases may be preferable, depending on the scenario. Based on this finding, we design a novel reinforcement learning algorithm, Balanced Q-learning, in which the target is modified to be a convex combination of a pessimistic and an optimistic term, whose associated weights are determined online, analytically. We prove the convergence of this algorithm in a tabular setting, and empirically demonstrate its superior learning performance in various environments.
翻译:Q-学习目标的乐观性质导致过高估计偏差,这是与标准Q-美元学习有关的固有问题,这种偏差没有考虑到低回报的可能性,特别是在风险假设中;然而,偏差的存在,无论是高估还是低估,不一定不可取。在本文件中,我们分析研究偏差学习的效用,并表明具体类型的偏差可能更可取,视情景而定。根据这一发现,我们设计了一种新的强化学习算法,即平衡Q-学习,其中将目标修改为悲观和乐观术语的组合,其相关权重通过在线分析确定。我们证明这种算法在表格环境中的趋同,并用经验显示它在各种环境中的优异学习表现。