How to efficiently explore in reinforcement learning is an open problem. Many exploration algorithms employ the epistemic uncertainty of their own value predictions -- for instance to compute an exploration bonus or upper confidence bound. Unfortunately the required uncertainty is difficult to estimate in general with function approximation. We propose epistemic value estimation (EVE): a recipe that is compatible with sequential decision making and with neural network function approximators. It equips agents with a tractable posterior over all their parameters from which epistemic value uncertainty can be computed efficiently. We use the recipe to derive an epistemic Q-Learning agent and observe competitive performance on a series of benchmarks. Experiments confirm that the EVE recipe facilitates efficient exploration in hard exploration tasks.
翻译:如何在强化学习中高效探索是一个尚未解决的问题。 许多勘探算法都使用其自身价值预测的隐喻不确定性 -- -- 例如计算勘探奖金或上限信心。 不幸的是,所需的不确定性很难用功能近似法来总体估算。 我们提出缩略图估计:一种与顺序决策相兼容的配方,与神经网络功能相匹配的配方。 它为代理提供了可移动的后继器,覆盖所有参数,从而可以有效地计算出共认价值不确定性。 我们使用该配方来提取一个缩略图 Q-学习代理,并观察一系列基准的竞争性表现。 实验证实, EVE配方有助于在硬性勘探任务中高效探索。</s>