By planning through a learned dynamics model, model-based reinforcement learning (MBRL) offers the prospect of good performance with little environment interaction. However, it is common in practice for the learned model to be inaccurate, impairing planning and leading to poor performance. This paper aims to improve planning with an importance sampling framework that accounts and corrects for discrepancy between the true and learned dynamics. This framework also motivates an alternative objective for fitting the dynamics model: to minimize the variance of value estimation during planning. We derive and implement this objective, which encourages better prediction on trajectories with larger returns. We observe empirically that our approach improves the performance of current MBRL algorithms on two stochastic control problems, and provide a theoretical basis for our method.
翻译:通过一个有学识的动态模型进行规划,以模型为基础的强化学习(MBRL)提供了良好业绩的前景,环境互动很少,然而,在实践中,学习模型往往不准确,损害规划,导致业绩不佳,本文件的目的是改进规划,建立一个重要的抽样框架,考虑并纠正真实动态与学习动态之间的差异,这个框架还提出一个适合动态模型的替代目标:在规划期间尽量减少价值估计的差异。我们提出并执行这个目标,它鼓励更好地预测回报较大的轨迹。我们从经验上看到,我们的方法改进了目前对两个随机控制问题的MBRL算法的性能,并为我们的方法提供了理论基础。