Distribution and sample models are two popular model choices in model-based reinforcement learning (MBRL). However, learning these models can be intractable, particularly when the state and action spaces are large. Expectation models, on the other hand, are relatively easier to learn due to their compactness and have also been widely used for deterministic environments. For stochastic environments, it is not obvious how expectation models can be used for planning as they only partially characterize a distribution. In this paper, we propose a sound way of using approximate expectation models for MBRL. In particular, we 1) show that planning with an expectation model is equivalent to planning with a distribution model if the state value function is linear in state features, 2) analyze two common parametrization choices for approximating the expectation: linear and non-linear expectation models, 3) propose a sound model-based policy evaluation algorithm and present its convergence results, and 4) empirically demonstrate the effectiveness of the proposed planning algorithm.
翻译:在基于模型的强化学习(MBRL)中,分布模型和样本模型是两种流行的模式选择。然而,学习这些模型可能难以解决,特别是当状态和行动空间很大时。另一方面,期望模型由于其紧凑性而相对容易学习,也广泛用于确定性环境。对于随机环境,尚不清楚如何将预期模型用于规划,因为它们只是分配的一部分特征。在本文件中,我们提出了一种合理的方式,为MBRL使用大致的预期模型。特别是,我们1 表明,如果状态值功能是直线的,那么使用预期模型进行规划就等同于使用分配模型进行规划;2 分析接近预期的两个共同的对称选择:线性和非线性预期模型;3 提出健全的基于模型的政策评价算法并展示其趋同结果;4 经验性地展示拟议规划算法的有效性。