具有创制模型的强力加强学习的复杂程度 (Sample Complexity of Robust Reinforcement Learning with a Generative Model)

The Robust Markov Decision Process (RMDP) framework focuses on designing control policies that are robust against the parameter uncertainties due to the mismatches between the simulator model and real-world settings. An RMDP problem is typically formulated as a max-min problem, where the objective is to find the policy that maximizes the value function for the worst possible model that lies in an uncertainty set around a nominal model. The standard robust dynamic programming approach requires the knowledge of the nominal model for computing the optimal robust policy. In this work, we propose a model-based reinforcement learning (RL) algorithm for learning an $\epsilon$-optimal robust policy when the nominal model is unknown. We consider three different forms of uncertainty sets, characterized by the total variation distance, chi-square divergence, and KL divergence. For each of these uncertainty sets, we give a precise characterization of the sample complexity of our proposed algorithm. In addition to the sample complexity results, we also present a formal analytical argument on the benefit of using robust policies. Finally, we demonstrate the performance of our algorithm on two benchmark problems.

翻译：强健的马尔科夫决策程序(RMDP)框架侧重于设计针对因模拟模型与现实世界设置不匹配而导致的参数不确定性的稳健控制政策。 RMDP问题通常被表述为一个最大问题,目的是找到一种政策,使最差的模型的价值功能最大化,而最差的模型则是围绕一种名义模型设定的不确定性。标准强力动态方案编制方法要求了解计算最佳稳健政策的名义模型。在这项工作中,我们提出一种基于模型的强化学习算法(RL),用于在模范模型未知时学习$\epsilon$-最优的稳健政策。我们考虑了三种不同形式的不确定性组合,其特点是完全差异距离、奇夸差异和KL差异。我们对这些不确定性组合中的每一种组合都作了精确的定性。除了抽样复杂的结果外,我们还就使用强健健政策的好处提出了正式的分析论证。最后,我们展示了我们两个基准问题算法的绩效。