For over a decade, model-based reinforcement learning has been seen as a way to leverage control-based domain knowledge to improve the sample-efficiency of reinforcement learning agents. While model-based agents are conceptually appealing, their policies tend to lag behind those of model-free agents in terms of final reward, especially in non-trivial environments. In response, researchers have proposed model-based agents with increasingly complex components, from ensembles of probabilistic dynamics models, to heuristics for mitigating model error. In a reversal of this trend, we show that simple model-based agents can be derived from existing ideas that not only match, but outperform state-of-the-art model-free agents in terms of both sample-efficiency and final reward. We find that a model-free soft value estimate for policy evaluation and a model-based stochastic value gradient for policy improvement is an effective combination, achieving state-of-the-art results on a high-dimensional humanoid control task, which most model-based agents are unable to solve. Our findings suggest that model-based policy evaluation deserves closer attention.
翻译:十年来,基于模型的强化学习被视为一种利用基于控制的领域知识提高强化学习机构抽样效率的方法。虽然基于模型的代理商在概念上具有吸引力,但其政策在最终奖励方面往往落后于无模型的代理商,特别是在非三重环境中。作为回应,研究人员提出了基于模型的代理商,其组成部分日益复杂,从概率动态模型的组合到减缓模型错误的超常性。在扭转这一趋势时,我们表明基于简单模型的代理商可以来自不仅匹配,而且在抽样效率和最终奖励方面都超过无模型的先进代理商的现有想法。我们发现,政策评价的无模型的软价值估计和政策改进的模型的随机价值梯度是一种有效的组合,在高维人类控制任务上取得最新的结果,而大多数基于模型的代理商无法解决。我们的调查结果表明,基于模型的政策评价值得更密切地关注。