We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. In particular, we focus on characterizing the variance over values induced by a distribution over MDPs. Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation, but the over-approximation may result in inefficient exploration. We propose a new uncertainty Bellman equation whose solution converges to the true posterior variance over values and explicitly characterizes the gap in previous work. Moreover, our uncertainty quantification technique is easily integrated into common exploration strategies and scales naturally beyond the tabular setting by using standard deep reinforcement learning architectures. Experiments in difficult exploration tasks, both in tabular and continuous control settings, show that our sharper uncertainty estimates improve sample-efficiency.
翻译:我们考虑了在基于模型的强化学习中将预期累积回报的不确定性量化的问题。特别是,我们侧重于确定由MDP分布导致的数值差异的特征。以前的工作通过解决所谓的不确定性Bellman方程式,将后方差异与数值的比值上下限,但过度赞同可能导致低效率的勘探。我们提出了一个新的不确定性Bellman方程式,其解决方案与真正的后继价值差异相匹配,并明确了以往工作中的差距特征。此外,我们的不确定性量化技术很容易通过使用标准的深度强化学习结构,自然地融入到表格设置之外的共同勘探战略和尺度中。 在表格和连续控制环境中的艰难勘探任务实验表明,我们更清晰的不确定性估计提高了样本效率。</s>