Value-based reinforcement-learning algorithms have shown strong performances in games, robotics, and other real-world applications. The most popular sample-based method is $Q$-Learning. A $Q$-value is the expected return for a state-action pair when following a particular policy, and the algorithm subsequently performs updates by adjusting the current $Q$-value towards the observed reward and the maximum of the $Q$-values of the next state. The procedure introduces maximization bias, and solutions like Double $Q$-Learning have been considered. We frame the bias problem statistically and consider it an instance of estimating the maximum expected value (MEV) of a set of random variables. We propose the $T$-Estimator (TE) based on two-sample testing for the mean. The TE flexibly interpolates between over- and underestimation by adjusting the level of significance of the underlying hypothesis tests. A generalization termed $K$-Estimator (KE) obeys the same bias and variance bounds as the TE while relying on a nearly arbitrary kernel function. Using the TE and the KE, we introduce modifications of $Q$-Learning and its neural network analog, the Deep $Q$-Network. The proposed estimators and algorithms are thoroughly tested and validated on a diverse set of tasks and environments, illustrating the performance potential of the TE and KE.
翻译:以价值为基础的强化学习算法在游戏、机器人和其他现实世界应用中表现出了很强的性能。最受欢迎的基于抽样的方法是 Q $- learning 。一个 $$ 值是 遵循特定政策时州- 对应行动的预期回报率。 该算法随后通过调整当前 Q $ 值以观察到的奖赏和下一个州最高Q 美元价值来进行更新。 该程序引入了最大化偏差,并考虑了双Q $- 学习等解决方案。我们从统计角度界定了偏见问题,认为它是估算一组随机变量最大预期值(MEV)的一个实例。我们提议基于对平均值的双模数测试的美元- 刺激(TE $- Estimator ) 。 通过调整基本假设测试的重要性水平,将高低和低估之间调的中间调。 通用称为 $K- Estimator (KE) 在依赖TE 近乎任意性核心的纳诺尔(ME ) 功能时,我们采用基于双模测试的TE- train- trainal- train 网络和TRal- $ Q- trainal- trangal- 和KE) 。我们采用了的系统和 和高压 和高压 和高压 Q- trade- trade- trade- trademental- trade- sal- sal- sal- trade- sal- trupal- sal- sal- sal- trupal- sal- sal- sal- sal- saluplegleglemental- Q) Q) 。