In reinforcement learning the Q-values summarize the expected future rewards that the agent will attain. However, they cannot capture the epistemic uncertainty about those rewards. In this work we derive a new Bellman operator with associated fixed point we call the `knowledge values'. These K-values compress both the expected future rewards and the epistemic uncertainty into a single value, so that high uncertainty, high reward, or both, can yield high K-values. The key principle is to endow the agent with a risk-seeking utility function that is carefully tuned to balance exploration and exploitation. When the agent follows a Boltzmann policy over the K-values it yields a Bayes regret bound of $\tilde O(L \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the total number of states, $A$ is the number of actions, and $T$ is the number of elapsed timesteps. We show deep connections of this approach to the soft-max and maximum-entropy strands of research in reinforcement learning.
翻译:在强化学习 Q 值的过程中, 总结了该代理商将实现的预期未来回报。 但是, 它们无法捕捉到这些回报的隐含不确定性。 在这项工作中, 我们用相关的固定点来开发一个新的贝尔曼操作员。 这些 K 值将预期的未来回报和隐含不确定性压缩成一个单一值, 以便高不确定性、 高回报或两者都能产生高K值。 关键原则是赋予该代理商一个风险搜索工具功能, 并仔细调整该功能, 以平衡勘探和开发。 当该代理商在 K 值上遵循布尔茨曼政策时, 它会产生一个被$\ tilde O( L\ sqrt{S A T}) 所约束的巴伊斯的遗憾, 其中美元是时间范围, $S 是国家的总数, $A 是行动的数量, $T 是过期的时间步骤的数量。 我们显示了这一方法与强化学习研究的软体和最大吸食区之间的深层联系。