The principle of optimism in the face of uncertainty is prevalent throughout sequential decision making problems such as multi-armed bandits and reinforcement learning (RL). To be successful, an optimistic RL algorithm must over-estimate the true value function (optimism) but not by so much that it is inaccurate (estimation error). In the tabular setting, many state-of-the-art methods produce the required optimism through approaches which are intractable when scaling to deep RL. We re-interpret these scalable optimistic model-based algorithms as solving a tractable noise augmented MDP. This formulation achieves a competitive regret bound: $\tilde{\mathcal{O}}( |\mathcal{S}|H\sqrt{|\mathcal{A}| T } )$ when augmenting using Gaussian noise, where $T$ is the total number of environment steps. We also explore how this trade-off changes in the deep RL setting, where we show empirically that estimation error is significantly more troublesome. However, we also show that if this error is reduced, optimistic model-based RL algorithms can match state-of-the-art performance in continuous control problems.
翻译:面对不确定性的乐观原则在整个连续决策问题,如多武装匪徒和强化学习(RL)等,都普遍存在在不确定性面前的乐观原则。为了取得成功,乐观的RL算法必须高估真正的价值函数(optimism),但不能高估不准确(估计错误)。在表格环境中,许多最先进的方法通过在缩放到深RL时难以做到的方法产生所需的乐观。我们重新将这些可缩放的乐观模型算法解释为解决可移动的噪音增强MDP。为了取得成功,乐观的RL算法必须有一个竞争的悔恨:当使用高斯噪音来加起来时,美元是环境步骤的总数。我们还探讨深RL设置中的这种权衡变化如何,我们从经验上表明,估计错误会大大增加麻烦。然而,我们还表明,如果这一错误减少,基于乐观的RL算法的模型算法可以匹配持续运行中的状态控制问题。