通过规范化对强力Markov决策程序进行有效的政策转换 (Efficient Policy Iteration for Robust Markov Decision Processes via Regularization)

Robust Markov decision processes (MDPs) provide a general framework to model decision problems where the system dynamics are changing or only partially known. Efficient methods for some \texttt{sa}-rectangular robust MDPs exist, using its equivalence with reward regularized MDPs, generalizable to online settings. In comparison to \texttt{sa}-rectangular robust MDPs, \texttt{s}-rectangular robust MDPs are less restrictive but much more difficult to deal with. Interestingly, recent works have established the equivalence between \texttt{s}-rectangular robust MDPs and policy regularized MDPs. But we don't have a clear understanding to exploit this equivalence, to do policy improvement steps to get the optimal value function or policy. We don't have a clear understanding of greedy/optimal policy except it can be stochastic. There exist no methods that can naturally be generalized to model-free settings. We show a clear and explicit equivalence between \texttt{s}-rectangular $L_p$ robust MDPs and policy regularized MDPs that resemble very much policy entropy regularized MDPs widely used in practice. Further, we dig into the policy improvement step and concretely derive optimal robust Bellman operators for \texttt{s}-rectangular $L_p$ robust MDPs. We find that the greedy/optimal policies in \texttt{s}-rectangular $L_p$ robust MDPs are threshold policies that play top $k$ actions whose $Q$ value is greater than some threshold (value), proportional to the $(p-1)$th power of its advantage. In addition, we show time complexity of (\texttt{sa} and \texttt{s}-rectangular) $L_p$ robust MDPs is the same as non-robust MDPs up to some log factors. Our work greatly extends the existing understanding of \texttt{s}-rectangular robust MDPs and naturally generalizable to online settings.

翻译：robust Markov 决策进程( MDPs) 为模拟系统动态正在变化或仅部分为人知的决策问题提供了一个总体框架。有趣的是, 最近的工作在某种\ texttt{ sa{ sa} 矩形强的MDPs 之间建立了等值。一些 mDPs 和正统的 MDPs 等值存在有效的方法。但是我们没有明确的理解来利用这种等值, 以便采取政策改进步骤来获得最佳的值功能或政策。与\ textt{sa} 矩形强的MDPs相比,\ text} 坚硬的MDPs 相形的 3⁄drial DPs 直径直径直径直径直径直的 Rightt 。我们展示了一种清晰和明确的对等值 { textt} $@ true mDPsal- developments 直径直径的 MDPsqral_ proads mrass mral_ dromas romas mutal demoal modal mods mods modal mods mods mods) mocial mods mods mods mods mods mods mocals mods mods mods mods mods mods modaldals modals mods mods modal modals mods mods mods mods mods mods mods mods mods mods mods mods mods ms ms mods mods ms ms ms ms ms ms ms ms mods ms ms ms ms ms mods mods ms mods mods mos mods mos mods mos mods mods mos mos ms ms mods ms