Policy gradient (PG) gives rise to a rich class of reinforcement learning (RL) methods. Recently, there has been an emerging trend to accelerate the existing PG methods such as REINFORCE by the \emph{variance reduction} techniques. However, all existing variance-reduced PG methods heavily rely on an uncheckable importance weight assumption made for every single iteration of the algorithms. In this paper, a simple gradient truncation mechanism is proposed to address this issue. Moreover, we design a Truncated Stochastic Incremental Variance-Reduced Policy Gradient (TSIVR-PG) method, which is able to maximize not only a cumulative sum of rewards but also a general utility function over a policy's long-term visiting distribution. We show an $\tilde{\mathcal{O}}(\epsilon^{-3})$ sample complexity for TSIVR-PG to find an $\epsilon$-stationary policy. By assuming the overparameterizaiton of policy and exploiting the hidden convexity of the problem, we further show that TSIVR-PG converges to global $\epsilon$-optimal policy with $\tilde{\mathcal{O}}(\epsilon^{-2})$ samples.
翻译:政策梯度( PG) 导致大量强化学习方法( RL ) 。 最近, 出现了通过 emph{ varience reduction} 技术加速现有 PG 方法( 如 REINFORCE ) 的趋势。 然而, 所有现有的差异降级 PG 方法都严重依赖对算法每次迭代的不可核实的重量假设。 在此文件中, 提议了一个简单的梯度脱轨机制来解决这个问题。 此外, 我们设计了一种快速的存储式递增差异化政策梯度( TSIVR- PG) 方法, 这种方法不仅能够使奖励的累积总和最大化, 而且在政策长期访问分布方面有一个通用功能。 我们展示了 TIVR- PG的样本复杂性, 以便找到一个 $\ epslon$- stative 政策。 我们假设政策的过度匹配性, 并且利用隐藏的共产值, 我们进一步展示了 $TS- 2\\\\ pGPGS- salon $。