We introduce new planning and reinforcement learning algorithms for discounted MDPs that utilize an approximate model of the environment to accelerate the convergence of the value function. Inspired by the splitting approach in numerical linear algebra, we introduce Operator Splitting Value Iteration (OS-VI) for both Policy Evaluation and Control problems. OS-VI achieves a much faster convergence rate when the model is accurate enough. We also introduce a sample-based version of the algorithm called OS-Dyna. Unlike the traditional Dyna architecture, OS-Dyna still converges to the correct value function in presence of model approximation error.
翻译:我们引入了利用环境近似模型加速价值函数趋同的贴现元件的新的规划和强化学习算法。在数值线性代数的分裂法的启发下,我们引入了用于政策评估和控制问题的操作员分割值迭代法(OS-VI ) 。当模型足够准确时,OS-VI 实现更快的趋同率。我们还引入了一个名为OS-Dyna的抽样算法版本。与传统的Dyna结构不同,OS-Dyna在出现模型近似错误的情况下仍然会与正确的价值函数趋同。