An effective approach to exploration in reinforcement learning is to rely on an agent's uncertainty over the optimal policy, which can yield near-optimal exploration strategies in tabular settings. However, in non-tabular settings that involve function approximators, obtaining accurate uncertainty estimates is almost as challenging a problem. In this paper, we highlight that value estimates are easily biased and temporally inconsistent. In light of this, we propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors. This exploration signal controls for state-action transitions so as to isolate uncertainty in value that is due to uncertainty over the agent's parameters. Because our measure of uncertainty conditions on state-action transitions, we cannot act on this measure directly. Instead, we incorporate it as an intrinsic reward and treat exploration as a separate learning problem, induced by the agent's temporal difference uncertainties. We introduce a distinct exploration policy that learns to collect data with high estimated uncertainty, which gives rise to a curriculum that smoothly changes throughout learning and vanishes in the limit of perfect value estimates. We evaluate our method on hard exploration tasks, including Deep Sea and Atari 2600 environments and find that our proposed form of exploration facilitates both diverse and deep exploration.
翻译:在强化学习中,有效的探索方法是依靠代理人对最佳政策的不确定性,这种政策可以在表格环境中产生接近最佳的勘探战略。然而,在涉及功能接近者的非热带环境中,获得准确的不确定性估计几乎是一个具有挑战性的问题。在本文中,我们强调,价值估计很容易产生偏差,而且时间上不一致。有鉴于此,我们提出一个新的方法来估计价值功能的不确定性,这种不确定性取决于时间差差错误的分布。这种对国家行动的过渡的勘探信号控制,以便分离由于代理人参数的不确定性而造成的价值不确定性。由于我们对国家行动过渡的不确定性状况的衡量,我们无法直接就这一措施采取行动。相反,我们把它作为一项内在的奖励,并将勘探作为一个单独的学习问题处理,由代理人的时差不确定性引起。我们引入了一种不同的勘探政策,学会收集具有高度估计不确定性的数据,从而形成一种课程,在整个学习过程中,在精确价值估计的限度中进行顺利变化。我们评估了硬勘探任务的方法,包括深海和阿塔里勘探环境的多样化,并发现我们提议的深层勘探环境形式。