In reinforcement learning the Q-values summarize the expected future rewards that the agent will attain. However, they cannot capture the epistemic uncertainty about those rewards. In this work we derive a new Bellman operator with associated fixed point we call the `knowledge values'. These K-values compress both the expected future rewards and the epistemic uncertainty into a single value, so that high uncertainty, high reward, or both, can yield high K-values. The key principle is to endow the agent with a risk-seeking utility function that is carefully tuned to balance exploration and exploitation. When the agent follows a Boltzmann policy over the K-values it yields a Bayes regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed timesteps. We show deep connections of this approach to the soft-max and maximum-entropy strands of research in reinforcement learning.
翻译:在强化学习 Q 值时, 总结了该代理商将实现的预期未来回报。 但是, 它们无法捕捉到这些回报的隐含不确定性。 在这项工作中, 我们从一个相关的固定点获得一个新的贝尔曼操作员, 我们称之为“ 知识值 ” 。 这些 K 值将预期的未来回报和隐含不确定性压缩成一个单一值, 这样高不确定性、 高报酬或两者都能够产生高K值。 关键原则是赋予该代理商一个风险寻求的公用事业功能, 该功能要小心调整, 以平衡勘探和开采。 当该代理商在 K 值上遵循波尔兹曼政策时, 产生一个“ 贝斯” 的遗憾, 也就是 $( L) 是时间范围, $( $) 是州的数量, $( $) 是行动的数量, $( $) 是行动的总数, $T$( ) 是过期时间段的总数。 我们显示了这一方法与学习中最软和最高峰的研究链的深度连接。