Many policy gradient methods optimize the objective, $\max_{\pi}E_{\pi}[A_{\pi_{old}}(s,a)]$, where $A_{\pi_{old}}$ is the advantage function of the old policy. The objective is not feasible to be directly optimized because we don't have samples for the new policy yet. Thus the importance sampling (IS) ratio arises, giving an IS corrected objective or the CPI objective, $\max_{\pi}E_{\pi_{old}}[\frac{\pi(s,a)}{\pi_{old}(s,a)}A_{\pi_{old}}(s,a)]$. However, optimizing this objective is still problematic due to extremely large IS ratios that can cause algorithms to fail catastrophically. Thus PPO uses a surrogate objective, and seeks an approximation to the solution in a clipped policy space, $\Pi_{\epsilon}=\{\pi; |\frac{\pi(s,a)}{\pi_{old}(s,a)}-1|<\epsilon \}$, where $\epsilon$ is a small positive number. One question that drives this paper is, {\em How grounded is this hypothesis that $\Pi_{\epsilon}$ contains good enough policies?} {\bfseries Does there exist better policies outside of $\mathbf{\Pi_{\epsilon}}$?} Using a novel surrogate objective that employs the sigmoid function resulting in an interesting way of exploration, we found that there indeed exists much better policies out of $\Pi_{\epsilon}$; In addition, these policies are located very far from it. We compare with several best-performing algorithms on both discrete and continuous tasks and the results showed that {\em PPO is insufficient in off-policyness}, and our new method P3O is {\em more off-policy} than PPO according to the "off-policyness" measured by the {\em DEON off-policy metric}, and P3O {\em \bfseries explores in a much larger policy space} than PPO.
翻译:许多政策梯度方法优化了目标, $max ⁇ pi}E ⁇ pí}[A ⁇ pi ⁇ old ⁇ (a,a)]$A ⁇ pi ⁇ old ⁇ (a,a)$是旧政策的好处功能。 目标无法直接优化, 因为我们还没有新政策样本。 因此, 重要取样( IS) 比率产生, 给 IS 纠正目标或CPI 目标, $maxíp} E ⁇ pi} [\\ precicial $(s,a)\\pieold} (s,a) A ⁇ pi{d} 美元(s) a) a) a (a) a (a) a (d) a (d) a (d) a (d) a (d) a (d) a (d) (d) (d) (d) (d) (d) (d) (a) (d) (d) (d (d) (d (d) (d) (d) (d (e) (d) (e) (e) (e) (e (e (e) (e) (e) (e) (e) (e) (e) (d) (e (e) (e) (e) (e) (e) (e) (e) (e) (e (e) (e) (e) (e) (a) (a) (a) (a) (a) (d) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (d) (a) (a) (a) (a) (a) (a) (a) (a) (d) (d) (a) (d) (d) (d) (d) (d) (d)) (d) (d) (d) (d) (d) (d) (d) (d) (a) (a) (d) (d