In this paper, we study the non-asymptotic and asymptotic performances of the optimal robust policy and value function of robust Markov Decision Processes(MDPs), where the optimal robust policy and value function are solved only from a generative model. While prior work focusing on non-asymptotic performances of robust MDPs is restricted in the setting of the KL uncertainty set and $(s,a)$-rectangular assumption, we improve their results and also consider other uncertainty sets, including $L_1$ and $\chi^2$ balls. Our results show that when we assume $(s,a)$-rectangular on uncertainty sets, the sample complexity is about $\widetilde{O}\left(\frac{|\mathcal{S}|^2|\mathcal{A}|}{\varepsilon^2\rho^2(1-\gamma)^4}\right)$. In addition, we extend our results from $(s,a)$-rectangular assumption to $s$-rectangular assumption. In this scenario, the sample complexity varies with the choice of uncertainty sets and is generally larger than the case under $(s,a)$-rectangular assumption. Moreover, we also show that the optimal robust value function is asymptotic normal with a typical rate $\sqrt{n}$ under $(s,a)$ and $s$-rectangular assumptions from both theoretical and empirical perspectives.
翻译:在本文中,我们研究了强健的马可夫决策程序(MDPs)的最佳稳健政策和价值功能的非自然和无自然表现。 稳健的马可夫决策程序(MDPs)的最佳稳健政策和价值功能只能从基因模型中解决。 虽然在设定 KL 不确定性组和$(a) 美元(recal)2 (rho)2 (rho)2 (1-\gamma) right (美元) 时,我们研究了其结果,并考虑了其他不确定性组,包括1美元和2美元。我们的结果显示,当我们假设在不确定性组中假定美元(s)美元(a) 美元(recoloral-retagal) 时,抽样复杂性大约是 $(ocloblittilde) {O\\ mathcal{S\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\