通过优化政策搜索和规划,开展高效、基于模式的强化学习 (Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning)

Model-based reinforcement learning algorithms with probabilistic dynamical models are amongst the most data-efficient learning methods. This is often attributed to their ability to distinguish between epistemic and aleatoric uncertainty. However, while most algorithms distinguish these two uncertainties for learning the model, they ignore it when optimizing the policy, which leads to greedy and insufficient exploration. At the same time, there are no practical solvers for optimistic exploration algorithms. In this paper, we propose a practical optimistic exploration algorithm (H-UCRL). H-UCRL reparameterizes the set of plausible models and hallucinates control directly on the epistemic uncertainty. By augmenting the input space with the hallucinated inputs, H-UCRL can be solved using standard greedy planners. Furthermore, we analyze H-UCRL and construct a general regret bound for well-calibrated models, which is provably sublinear in the case of Gaussian Process models. Based on this theoretical foundation, we show how optimistic exploration can be easily combined with state-of-the-art reinforcement learning algorithms and different probabilistic models. Our experiments demonstrate that optimistic exploration significantly speeds-up learning when there are penalties on actions, a setting that is notoriously difficult for existing model-based reinforcement learning algorithms.

翻译：以模型为基础的强化学习算法以及概率性动态模型是数据效率最高的学习方法之一。这通常归因于它们能够区分认知和感知不确定性。然而,虽然大多数算法在学习模型时区分了这两种不确定性,但在优化政策时却忽略了这两种不确定性,导致贪婪和不充分的探索。与此同时,对于乐观的勘探算法,我们没有实用的解决方案。在本文中,我们建议一种实用的乐观的探索算法(H-UCRL),H-UCRL重新标定了一套可信的模型和幻觉控制能力,直接针对感知不确定性。通过利用拉动性输入空间来增加输入空间,H-UCRL可以使用标准的贪婪规划者来解决。此外,我们分析H-UCRL,并构建一个整齐的模型的总体遗憾,在高斯进程模型中,这是可辨的次要线。基于这一理论基础,我们展示了乐观的探索与状态的强化学习算法和不同的强化演算法实验,而我们对现有的实验则是一种剧烈的学习速度。