时间差异化-对政策分级方法的安全行动重复 (Time Discretization-Invariant Safe Action Repetition for Policy Gradient Methods)

In reinforcement learning, continuous time is often discretized by a time scale $\delta$, to which the resulting performance is known to be highly sensitive. In this work, we seek to find a $\delta$-invariant algorithm for policy gradient (PG) methods, which performs well regardless of the value of $\delta$. We first identify the underlying reasons that cause PG methods to fail as $\delta \to 0$, proving that the variance of the PG estimator can diverge to infinity in stochastic environments under a certain assumption of stochasticity. While durative actions or action repetition can be employed to have $\delta$-invariance, previous action repetition methods cannot immediately react to unexpected situations in stochastic environments. We thus propose a novel $\delta$-invariant method named Safe Action Repetition (SAR) applicable to any existing PG algorithm. SAR can handle the stochasticity of environments by adaptively reacting to changes in states during action repetition. We empirically show that our method is not only $\delta$-invariant but also robust to stochasticity, outperforming previous $\delta$-invariant approaches on eight MuJoCo environments with both deterministic and stochastic settings. Our code is available at https://vision.snu.ac.kr/projects/sar.

翻译：在加固学习中,连续时间往往被一个时间尺度($\delta$)分解,由此得出的性能是高度敏感的。在这项工作中,我们寻求为政策梯度(PG)方法找到一个$delta$的变量算法,这种算法无论值为$\delta$,都表现良好。我们首先找出导致PG方法不能以美元=delta美元=美元=美元=0美元=0美元而导致PG方法失败的根本原因,证明PG测量仪在某种随机性假设下,在随机性环境中,其差异可能与随机性环境的无限性发生差异。虽然可以使用模擬性行动或重复性行动来拥有 $\delta$-不变,但以往的重复性方法无法立即对随机性环境中的意外情况作出反应。因此我们提出了一个名为$delta$-involitive restable (SAR) 方法,适用于任何现有的PGAG值算法。SAR可以通过在重复行动期间对变化的适应性反应来处理环境的随机性。我们的方法不仅在$\deltatatan-calal-dealtial-cality destiality acality destitionalmentalmentalmentalmentalmentalmentalmentalmentalmentalmentalmentalmental),而且还在以前的八个环境上是有效的。

相关内容

关注 0

Pacific Graphics是亚洲图形协会的旗舰会议。作为一个非常成功的会议系列，太平洋图形公司为太平洋沿岸以及世界各地的研究人员，开发人员，从业人员提供了一个高级论坛，以介绍和讨论计算机图形学及相关领域的新问题，解决方案和技术。太平洋图形会议的目的是召集来自各个领域的研究人员，以展示他们的最新成果，开展合作并为研究领域的发展做出贡献。会议将包括定期的论文讨论会，进行中的讨论会，教程以及由与计算机图形学和交互系统相关的所有领域的国际知名演讲者的演讲。官网地址：http://dblp.uni-trier.de/db/conf/pg/index.html

【ICML2021】弹性图神经网络

专知会员服务

37+阅读 · 2021年7月17日

【ICML2021】域自适应回归的子空间距离表示

专知会员服务

23+阅读 · 2021年6月28日

策略梯度方法的算子视图，An operator view of policy gradient methods

专知会员服务

11+阅读 · 2020年6月23日

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日