We present new policy mirror descent (PMD) methods for solving reinforcement learning (RL) problems with either strongly convex or general convex regularizers. By exploring the structural properties of these overall seemly highly nonconvex problems we show that the PMD methods exhibit fast linear rate of convergence to the global optimality. We develop stochastic counterparts of these methods, and establish an ${\cal O}(1/\epsilon)$ (resp., ${\cal O}(1/\epsilon^2)$) sampling complexity for solving these RL problems with strongly (resp., general) convex regularizers using different sampling schemes, where $\epsilon$ denote the target accuracy. We further show that the complexity for computing the gradients of these regularizers, if necessary, can be bounded by ${\cal O}\{(\log_\gamma \epsilon) [(1-\gamma)L/\mu]^{1/2}\log (1/\epsilon)\}$ (resp., ${\cal O} \{(\log_\gamma \epsilon ) (L/\epsilon)^{1/2}\}$)for problems with strongly (resp., general) convex regularizers. Here $\gamma$ denotes the discounting factor. To the best of our knowledge, these complexity bounds, along with our algorithmic developments, appear to be new in both optimization and RL literature. The introduction of these convex regularizers also greatly expands the flexibility and applicability of RL models.
翻译:我们通过探讨这些总体上似乎高度非隐性的问题的结构特性,表明PMD方法具有与全球最佳化的快速线性趋同率。我们开发了这些方法的随机对应方法,并建立了美元(1-\gamma) (1/\epsilon)$(resp.,$_cal O})(1/\epsilon2)美元)的取样复杂度,以解决这些强化学习(RL)问题。通过探索这些整体上似乎高度非隐含问题的结构性性质,我们发现,PMd方法表现出与全球最佳化的趋同率快速直线性。我们开发这些方法的复杂度可以被$ocal O}(1/\gamma) (1-\gmma)L/\\%%%%%%(1/\epsilon) 和我们最常变现性(美元)的缩略性变现性变现性(R_gasimalislus) 和这些普通变现性变现性变现性(L_x) 的变现性变现性(Rislus/calislationalislus)。