Reinforcement learning algorithms based on Q-learning are driving Deep Reinforcement Learning (DRL) research towards solving complex problems and achieving super-human performance on many of them. Nevertheless, Q-Learning is known to be positively biased since it learns by using the maximum over noisy estimates of expected values. Systematic overestimation of the action values coupled with the inherently high variance of DRL methods can lead to incrementally accumulate errors, causing learning algorithms to diverge. Ideally, we would like DRL agents to take into account their own uncertainty about the optimality of each action, and be able to exploit it to make more informed estimations of the expected return. In this regard, Weighted Q-Learning (WQL) effectively reduces bias and shows remarkable results in stochastic environments. WQL uses a weighted sum of the estimated action values, where the weights correspond to the probability of each action value being the maximum; however, the computation of these probabilities is only practical in the tabular setting. In this work, we provide methodological advances to benefit from the WQL properties in DRL, by using neural networks trained with Dropout as an effective approximation of deep Gaussian processes. In particular, we adopt the Concrete Dropout variant to obtain calibrated estimates of epistemic uncertainty in DRL. The estimator, then, is obtained by taking several stochastic forward passes through the action-value network and computing the weights in a Monte Carlo fashion. Such weights are Bayesian estimates of the probability of each action value corresponding to the maximum w.r.t. a posterior probability distribution estimated by Dropout. We show how our novel Deep Weighted Q-Learning algorithm reduces the bias w.r.t. relevant baselines and provides empirical evidence of its advantages on representative benchmarks.
翻译:以 Q 学习为基础的强化学习算法正在推动深强化学习(DRL) 研究, 以解决复杂问题并实现其中很多问题的超人性能。 尽管如此, Q学习已知具有积极的偏差性, 因为它通过使用对预期值的超吵估计来学习。 系统性地高估行动值加上DRL方法固有的巨大差异可能导致累积错误, 导致学习算法的差异。 理想的情况是, 我们希望 DRL 代理商能够考虑到他们自己对每项行动的最佳性( DRL)的不确定性, 并能够利用它来对预期回报做出更知情的估计。 在这方面, 加权Q- Lear(WQL) 有效地减少了偏差, 并显示出在随机环境中的显著效果。 WQQL 使用一个加权的动作值来计算估计, 其中的权重与每个动作值的概率相符; 然而, 这些概率的计算方法只在表格设置中比较实用。 在这项工作中, 我们提供方法进步, 从W QL 的内值中, 通过经过对数值进行更精确的数值的精确的计算, 通过对数值进行精确的计算, 通过对数值进行精确的计算, 。