In Model-based Reinforcement Learning (MBRL), model learning is critical since an inaccurate model can bias policy learning via generating misleading samples. However, learning an accurate model can be difficult since the policy is continually updated and the induced distribution over visited states used for model learning shifts accordingly. Prior methods alleviate this issue by quantifying the uncertainty of model-generated samples. However, these methods only quantify the uncertainty passively after the samples were generated, rather than foreseeing the uncertainty before model trajectories fall into those highly uncertain regions. The resulting low-quality samples can induce unstable learning targets and hinder the optimization of the policy. Moreover, while being learned to minimize one-step prediction errors, the model is generally used to predict for multiple steps, leading to a mismatch between the objectives of model learning and model usage. To this end, we propose \emph{Plan To Predict} (P2P), an MBRL framework that treats the model rollout process as a sequential decision making problem by reversely considering the model as a decision maker and the current policy as the dynamics. In this way, the model can quickly adapt to the current policy and foresee the multi-step future uncertainty when generating trajectories. Theoretically, we show that the performance of P2P can be guaranteed by approximately optimizing a lower bound of the true environment return. Empirical results demonstrate that P2P achieves state-of-the-art performance on several challenging benchmark tasks.
翻译:在基于模型的强化学习(MBRL)中,模型学习至关重要,因为不准确的模型可以通过产生误导性样本而使政策学习偏差产生误导性样本,而学习准确的模型可能很困难,因为政策不断更新,而且用于示范学习的受访国的随机分布也相应地变化。先前的方法通过量化模型产生的样本的不确定性来缓解这一问题。然而,这些方法只是在样本产生后被动地量化不确定性,而不是在模型轨迹落入高度不确定的区域之前预测不确定性。由此产生的低质量样本可能导致不稳定的学习目标并阻碍政策的优化。此外,在学习最大限度地减少一步预测错误的同时,该模型通常被用来预测多个步骤,导致模型学习和模型使用目标之间的不匹配。为此,我们提议采用“emph{Plan To urning}”(P2P2P2P2)框架,将模型推出过程视为一个顺序决策问题,反过来将模型视为决策师和当前政策作为动态。在这个过程中,该模型可以快速适应当前的政策,并预见多步预测多步的P2基准,从而显示我们所保证的未来不确定性。