Designing computationally efficient exploration strategies for on-policy first-order methods that attain optimal $\mathcal{O}(1/\epsilon^2)$ sample complexity remains open for solving Markov decision processes (MDP). This manuscript provides an answer to this question from a perspective of simplicity, by showing that whenever exploration over the state space is implied by the MDP structure, there seems to be little need for sophisticated exploration strategies. We revisit a stochastic policy gradient method, named stochastic policy mirror descent, applied to the infinite horizon, discounted MDP with finite state and action spaces. Accompanying SPMD we present two on-policy evaluation operators, both simply following the policy for trajectory collection with no explicit exploration, or any form of intervention. SPMD with the first evaluation operator, named value-based estimation, tailors to the Kullback-Leibler (KL) divergence. Provided the Markov chains on the state space of generated policies are uniformly mixing with non-diminishing minimal visitation measure, an $\tilde{\mathcal{O}}( 1 / \epsilon^2)$ sample complexity is obtained with a linear dependence on the size of the action space. SPMD with the second evaluation operator, named truncated on-policy Monte Carlo, attains an $\tilde{\mathcal{O}}(\mathcal{H}_{\mathcal{D}} / \epsilon^2)$ sample complexity, with the same assumption on the state chains of generated policies. We characterize $\mathcal{H}_{\mathcal{D}}$ as a divergence-dependent function of the effective horizon and the size of the action space, which leads to an exponential dependence of the latter two quantities for the KL divergence, and a polynomial dependence for the divergence induced by negative Tsallis entropy. These obtained sample complexities seem to be new among on-policy stochastic policy gradient methods without explicit explorations.
翻译:在政策一级方法上设计效率高的勘探战略,这些方法达到最佳 ${mathcal{O}(1/\ epsilon2) {mascal}}(1/\ epsilon2)$ 样本复杂性对于解决 Markov 决策程序(MDP)来说仍然开放。本手稿从简单的角度为这一问题提供了答案,它表明,每当对州空间的探索被MDP结构所暗含时,似乎就几乎不需要复杂的勘探战略。如果在生成的政策的状态空间上的Markov 梯度链与非最小访问度的平底线相统一,在有限的状态和动作空间评估空间的精度上,在轨迹收集政策时不进行明确的探索或任何形式的干预。 SPMD=xxmlx(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</s>