In reinforcement learning, continuous time is often discretized by a time scale $\delta$, to which the resulting performance is known to be highly sensitive. In this work, we seek to find a $\delta$-invariant algorithm for policy gradient (PG) methods, which performs well regardless of the value of $\delta$. We first identify the underlying reasons that cause PG methods to fail as $\delta \to 0$, proving that the variance of the PG estimator can diverge to infinity in stochastic environments under a certain assumption of stochasticity. While durative actions or action repetition can be employed to have $\delta$-invariance, previous action repetition methods cannot immediately react to unexpected situations in stochastic environments. We thus propose a novel $\delta$-invariant method named Safe Action Repetition (SAR) applicable to any existing PG algorithm. SAR can handle the stochasticity of environments by adaptively reacting to changes in states during action repetition. We empirically show that our method is not only $\delta$-invariant but also robust to stochasticity, outperforming previous $\delta$-invariant approaches on eight MuJoCo environments with both deterministic and stochastic settings. Our code is available at https://vision.snu.ac.kr/projects/sar.
翻译:在加固学习中,连续时间往往被一个时间尺度($\delta$)分解,由此得出的性能是高度敏感的。在这项工作中,我们寻求为政策梯度(PG)方法找到一个$delta$的变量算法,这种算法无论值为$\delta$,都表现良好。我们首先找出导致PG方法不能以美元=delta美元=美元=美元=0美元=0美元而导致PG方法失败的根本原因,证明PG测量仪在某种随机性假设下,在随机性环境中,其差异可能与随机性环境的无限性发生差异。虽然可以使用模擬性行动或重复性行动来拥有 $\delta$-不变,但以往的重复性方法无法立即对随机性环境中的意外情况作出反应。因此我们提出了一个名为$delta$-involitive restable (SAR) 方法,适用于任何现有的PGAG值算法。SAR可以通过在重复行动期间对变化的适应性反应来处理环境的随机性。我们的方法不仅在$\deltatatan-calal-dealtial-cality destiality acality destitionalmentalmentalmentalmentalmentalmentalmentalmentalmentalmentalmentalmental),而且还在以前的八个环境上是有效的。