An inherent problem in reinforcement learning is coping with policies that are uncertain about what action to take (or the value of a state). Model uncertainty, more formally known as epistemic uncertainty, refers to the expected prediction error of a model beyond the sampling noise. In this paper, we propose a metric for epistemic uncertainty estimation in Q-value functions, which we term pathwise epistemic uncertainty. We further develop a method to compute its approximate upper bound, which we call F -value. We experimentally apply the latter to Deep Q-Networks (DQN) and show that uncertainty estimation in reinforcement learning serves as a useful indication of learning progress. We then propose a new approach to improving sample efficiency in actor-critic algorithms by learning from an existing (previously learned or hard-coded) oracle policy while uncertainty is high, aiming to avoid unproductive random actions during training. We term this Critic Confidence Guided Exploration (CCGE). We implement CCGE on Soft Actor-Critic (SAC) using our F-value metric, which we apply to a handful of popular Gym environments and show that it achieves better sample efficiency and total episodic reward than vanilla SAC in limited contexts.
翻译:强化学习的一个固有问题是,在强化学习中,模型不确定性(更正式称为缩写不确定性)是指抽样噪音之外模型的预期预测错误。在本文中,我们建议了Q值函数中的缩写不确定性估算指标,我们用它来定义路径上表的不确定性。我们进一步开发了一种方法来计算其近乎的上层约束值,我们称之为F - 值。我们实验性地将后者应用到深QNetworks(DQN),并表明,在强化学习中,不确定性的估算是学习进展的有益标志。我们然后提出一种新的方法,通过学习现有的(以前学过或硬编码的)或缩略语政策,提高演员-批评算法的抽样效率,在不确定性高的同时,目的是避免非生产性的随机行动。我们称之为Crit 信任指导探索(CCGE) 。我们用我们的F值指标将“软动作-CAC”(SAC)应用到少数流行的GymAC环境,并显示它能更好实现抽样效率。