Policy robustness in Reinforcement Learning (RL) may not be desirable at any price; the alterations caused by robustness requirements from otherwise optimal policies should be explainable and quantifiable. Policy gradient algorithms that have strong convergence guarantees are usually modified to obtain robust policies in ways that do not preserve algorithm guarantees, which defeats the purpose of formal robustness requirements. In this work we study a notion of robustness in partially observable MDPs where state observations are perturbed by a noise-induced stochastic kernel. We characterise the set of policies that are maximally robust by analysing how the policies are altered by this kernel. We then establish a connection between such robust policies and certain properties of the noise kernel, as well as with structural properties of the underlying MDPs, constructing sufficient conditions for policy robustness. We use these notions to propose a robustness-inducing scheme, applicable to any policy gradient algorithm, to formally trade off the reward achieved by a policy with its robustness level through lexicographic optimisation, which preserves convergence properties of the original algorithm. We test the the proposed approach through numerical experiments on safety-critical RL environments, and show how the proposed method helps achieve high robustness when state errors are introduced in the policy roll-out.
翻译:强化学习(RL)中政策稳健性可能不可取,任何价格都不可取;强健性要求从其他最佳政策中引起的改变应可解释和量化;具有强烈趋同保证的政策梯度算法通常会修改,以获得不维护算法保障的稳健性政策,这不符合正式稳健性要求的目的;在这项工作中,我们研究部分可观测的MDP的稳健性概念,因为国家观测受到噪音引起的随机随机内核的干扰;我们通过分析该内核如何改变政策,来说明这组政策最稳健性的政策。然后,我们在这种稳健政策与噪音内核的某些特性以及基本MDP的结构特性之间建立了联系,为政策稳健性创造了充分的条件。我们利用这些概念提出一个适用于任何政策梯度算法的稳健性诱导计划,以便通过保持原算法的趋同性水平,从而保持原算法的趋同性,从而正式地权衡出该套政策所取得的奖赏。我们通过在严酷的RL环境中推行高压性政策时,通过数字性试验来检验拟议的方法如何助实现稳健健健健的RL方法。