The classic DQN algorithm is limited by the overestimation bias of the learned Q-function. Subsequent algorithms have proposed techniques to reduce this problem, without fully eliminating it. Recently, the Maxmin and Ensemble Q-learning algorithms have used different estimates provided by the ensembles of learners to reduce the overestimation bias. Unfortunately, these learners can converge to the same point in the parametric or representation space, falling back to the classic single neural network DQN. In this paper, we describe a regularization technique to maximize ensemble diversity in these algorithms. We propose and compare five regularization functions inspired from economics theory and consensus optimization. We show that the regularized approach significantly outperforms the Maxmin and Ensemble Q-learning algorithms as well as non-ensemble baselines.
翻译:经典的 DQN 算法因高估已学的 Q- 函数的偏差而受到限制。 后继算法提出了减少这一问题的方法, 但没有完全消除它。 最近, Maxmin 和 Ensemble Q- 学习算法使用了学习者组合提供的不同估计来减少高估偏差。 不幸的是, 这些学习者可以在参数或表示空间中聚集到同一点, 回到经典的单一神经网络 DQN 。 在本文中, 我们描述一种正规化技术, 最大限度地实现这些算法中的共性多样性。 我们提出并比较了五个来自经济学理论和共识优化的正规化功能。 我们显示,正规化方法大大超越了 Maxmin 和 Ensemble Q- 学习算法以及非整体基线。