通过不确定性的调控软软件更新进行的时间- 时间差异值估计 (Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates)

Temporal-Difference (TD) learning methods, such as Q-Learning, have proven effective at learning a policy to perform control tasks. One issue with methods like Q-Learning is that the value update introduces bias when predicting the TD target of a unfamiliar state. Estimation noise becomes a bias after the max operator in the policy improvement step, and carries over to value estimations of other states, causing Q-Learning to overestimate the Q value. Algorithms like Soft Q-Learning (SQL) introduce the notion of a soft-greedy policy, which reduces the estimation bias via soft updates in early stages of training. However, the inverse temperature $\beta$ that controls the softness of an update is usually set by a hand-designed heuristic, which can be inaccurate at capturing the uncertainty in the target estimate. Under the belief that $\beta$ is closely related to the (state dependent) model uncertainty, Entropy Regularized Q-Learning (EQL) further introduces a principled scheduling of $\beta$ by maintaining a collection of the model parameters that characterizes model uncertainty. In this paper, we present Unbiased Soft Q-Learning (UQL), which extends the work of EQL from two action, finite state spaces to multi-action, infinite state space Markov Decision Processes. We also provide a principled numerical scheduling of $\beta$, extended from SQL and using model uncertainty, during the optimization process. We show the theoretical guarantees and the effectiveness of this update method in experiments on several discrete control environments.

翻译：Q- Learning (SQL) 等时间差异学习方法在学习执行控制任务的政策方面已证明是有效的。 Q- Learning(Q-Learning) 等方法的一个问题是, 值更新在预测不熟悉状态的TD目标时会带有偏差。估计噪音在政策改进步骤的最大操作者之后会成为一种偏差, 并延续到其他国家的价值估计, 导致Q- 学习高估Q值。像 Soft Q-Learning (SQL) 这样的 Algorithm 引入软调整政策的概念, 通过在培训的早期阶段软更新来减少估算偏差。然而, 值更新值更新在预测一个陌生状态目标改进步骤的最大操作者之后会形成偏差, 导致对其它国家的估算值进行高估。相信 $\ beta 与( 取决于状态) 模型的不确定性密切相关, Etropyalalal Q-L (EQL) 进一步引入一个基于 $\beeta$ 的有原则的列表, 通过维持当前S- Q- Q- L rodealalalalalalalal QL roal roal roal rol rol rol rol rol rolal rocal rocal rocal rocal rocal Q) 的计算, Proutdal a ex a ex ex ex ex im sal 。