We present a novel robust policy gradient method (RPG) for s-rectangular robust Markov Decision Processes (MDPs). We are the first to derive the adversarial kernel in a closed form and demonstrate that it is a one-rank perturbation of the nominal kernel. This allows us to derive an RPG that is similar to the one used in non-robust MDPs, except with a robust Q-value function and an additional correction term. Both robust Q-values and correction terms are efficiently computable, thus the time complexity of our method matches that of non-robust MDPs, which is significantly faster compared to existing black box methods.
翻译:我们为长方形强健马可夫决策流程(MDPs)提出了一个新型的稳健政策梯度方法(RPG ) 。 我们是第一个以封闭形式生成对立内核,并表明这是名义内核的一阶扰动。这使我们能够生成一个与非紫外形MDP中使用的相似的火箭榴弹,但具有稳健的Q值功能和额外的校正术语除外。稳健的Q值和校正术语都是有效的可调和的,因此我们方法的时间复杂性与非紫外形MDP方法相匹配,与现有的黑盒方法相比速度要快得多。