Policy Mirror Descent天生具有探索动作空间的特性 (Policy Mirror Descent Inherently Explores Action Space)

Explicit exploration in the action space was assumed to be indispensable for online policy gradient methods to avoid a drastic degradation in sample complexity, for solving general reinforcement learning problems over finite state and action spaces. In this paper, we establish for the first time an $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexity for online policy gradient methods without incorporating any exploration strategies. The essential development consists of two new on-policy evaluation operators and a novel analysis of the stochastic policy mirror descent method (SPMD). SPMD with the first evaluation operator, called value-based estimation, tailors to the Kullback-Leibler divergence. Provided the Markov chains on the state space of generated policies are uniformly mixing with non-diminishing minimal visitation measure, an $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexity is obtained with a linear dependence on the size of the action space. SPMD with the second evaluation operator, namely truncated on-policy Monte Carlo (TOMC), attains an $\tilde{\mathcal{O}}(\mathcal{H}_{\mathcal{D}}/\epsilon^2)$ sample complexity, where $\mathcal{H}_{\mathcal{D}}$ mildly depends on the effective horizon and the size of the action space with properly chosen Bregman divergence (e.g., Tsallis divergence). SPMD with TOMC also exhibits stronger convergence properties in that it controls the optimality gap with high probability rather than in expectation. In contrast to explicit exploration, these new policy gradient methods can prevent repeatedly committing to potentially high-risk actions when searching for optimal policies.

翻译：抽象：在有限状态和动作空间下，为了解决一般强化学习问题，显式探索行动空间被认为是在线策略梯度方法避免样本复杂性急剧降低的不可或缺的手段。在该论文中，我们首次建立出在线策略梯度方法的 $\tilde{\mathcal {O}} (1/ \epsilon ^ 2)$ 样本复杂度，而不需要加入任何探索策略。该工作的主要发展包括两个新的策略评估算子和一种新的随机策略镜像下降方法 (SPMD) 的分析方式。SPMD 使用第一个评估算子，称为基于价值估计，来适应库尔巴克 - 莱布勒散度。假设生成的策略对状态空间上的马尔可夫链具有均匀混合性并且最小访问度量不减少，则可以在线性依赖于动作空间大小的情况下获得 $\tilde{\mathcal{O}}(1/ \epsilon ^ 2)$ 样本复杂度。SPMD 使用第二个评估算子，即截断在线蒙特卡洛 (TOMC)，则通过选择合适的Bregman散度 (例如Tsallis散度)，实现了 $\tilde{\mathcal{O}}(\mathcal{H}_{\mathcal{D}}/ \epsilon ^ 2)$ 的样本复杂度，其中 $\mathcal{H}_{\mathcal{D}}$ 稍微依赖于有效时间和动作空间的规模。 SPMD 使用TOMC还具有更强的收敛特性，因为它具有在概率上控制最优差异而不是期望的优点。与显式探索相比，这些新的策略梯度方法可以防止在寻找最优策略时反复采取潜在高风险行动。