与Regret Bounds 相异的贝耶斯河加固学习 (Variational Bayesian Reinforcement Learning with Regret Bounds)

We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.

翻译：我们考虑在强化学习中勘探-开发交易,我们发现,一个具有寻求风险的公用事业功能的代理商能够以遗憾度量来有效探索。控制该代理商如何准确优化或按照时间表进行整顿的参数。我们将由此产生的算法K学习称为算法,并显示相应的K值对每对州行动的预期Q值是乐观的。K值诱发一种自然的博尔茨曼勘探政策,其“温度”参数与寻求风险的参数相等。这一政策将实现预期的“美元”Tillde O(L+3/2}\sqrt{SAT})的遗憾约束,而美元是时间跨度,美元是州数,美元是行动的总数,美元是每个州行动的总数。K值只能比既定的较低约束值更大。K学习可以被解读为政策空间的镜反下降值,而K-级的算法则是以最高级的货币学习方法,而我们所了解的也只是简单、更精确的货币学习方法,而我们所了解的也只是以软的货币排序法。