This paper addresses the problem of model-free reinforcement learning for Robust Markov Decision Process (RMDP) with large state spaces. The goal of the RMDP framework is to find a policy that is robust against the parameter uncertainties due to the mismatch between the simulator model and real-world settings. We first propose the Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation. We prove the convergence of this algorithm using stochastic approximation techniques. We then propose Robust Least Squares Policy Iteration (RLSPI) algorithm for learning the optimal robust policy. We also give a general weighted Euclidean norm bound on the error (closeness to optimality) of the resulting policy. Finally, we demonstrate the performance of our RLSPI algorithm on some standard benchmark problems.
翻译:本文探讨无模型强化学习的问题, 用于使用大型国家空间的robust Markov 决策程序( RMDP) 。 RMDP 框架的目标是找到一种政策, 以抵御因模拟模型与现实世界设置不匹配而出现的参数不确定性。 我们首先提议采用robust 最低广场政策评估算法, 这是一种多步骤的在线无模型学习算法, 用于政策评估。 我们用随机近似技术证明了这种算法的趋同性。 我们然后提议采用 robust 最低广场政策转换算法( RLSPI) 来学习最优化的稳健的政策。 我们还对由此产生的政策错误( 接近于最佳性) 给出了通用的加权 Euclidean 规范 。 最后, 我们展示了 RLSPI 算法在某些标准基准问题上的性能 。