Value-based reinforcement-learning algorithms have shown strong performances in games, robotics, and other real-world applications. The most popular sample-based method is $Q$-Learning. It subsequently performs updates by adjusting the current $Q$-estimate towards the observed reward and the maximum of the $Q$-estimates of the next state. The procedure introduces maximization bias with approaches like Double $Q$-Learning. We frame the bias problem statistically and consider it an instance of estimating the maximum expected value (MEV) of a set of random variables. We propose the $T$-Estimator (TE) based on two-sample testing for the mean, that flexibly interpolates between over- and underestimation by adjusting the significance level of the underlying hypothesis tests. A generalization, termed $K$-Estimator (KE), obeys the same bias and variance bounds as the TE while relying on a nearly arbitrary kernel function. We introduce modifications of $Q$-Learning and the Bootstrapped Deep $Q$-Network (BDQN) using the TE and the KE. Furthermore, we propose an adaptive variant of the TE-based BDQN that dynamically adjusts the significance level to minimize the absolute estimation bias. All proposed estimators and algorithms are thoroughly tested and validated on diverse tasks and environments, illustrating the bias control and performance potential of the TE and KE.
翻译:以价值为基础的强化学习算法在游戏、机器人和其他现实世界应用中表现出了很强的性能。最受欢迎的抽样方法是以美元为单位学习。它随后通过调整当前美元估计数,使之适应观察到的奖赏和下一个状态的美元估计数的最大值。该程序采用双Q美元学习等方法引入了最大化偏差。我们从统计角度对偏差问题进行了框架,并将其视为估算一组随机变量的最大预期值(MEV)的一个实例。我们根据对平均值的两次抽样测试,提出了美元-美元模拟器(TE),通过调整基本假设测试的重要性,灵活地将高估和低估之间调。一般化,称为$-美元学习器(KE),在依赖近乎任意的内核功能的情况下,遵循同样的偏差和差异界限。我们提出对美元差异学习和深度的深度 $-Q-Network(TE-E) 的测试,我们提议对基于 TE-N 和绝对的定型的定型的定型成本值环境进行彻底的调整。我们提议对基于 和KE 度的变式的变式和绝对的定型的变式的变式,对的变式的变式的变式的变式的变式和变式的变式的变式的变式的变式的变式的变式的变式的变式的变式的变式的变式的变式的变式的变式的变式的变式的对KQ。