Knowing the learning dynamics of policy is significant to unveiling the mysteries of Reinforcement Learning (RL). It is especially crucial yet challenging to Deep RL, from which the remedies to notorious issues like sample inefficiency and learning instability could be obtained. In this paper, we study how the policy networks of typical DRL agents evolve during the learning process by empirically investigating several kinds of temporal change for each policy parameter. On typical MuJoCo and DeepMind Control Suite (DMC) benchmarks, we find common phenomena for TD3 and RAD agents: 1) the activity of policy network parameters is highly asymmetric and policy networks advance monotonically along very few major parameter directions; 2) severe detours occur in parameter update and harmonic-like changes are observed for all minor parameter directions. By performing a novel temporal SVD along policy learning path, the major and minor parameter directions are identified as the columns of right unitary matrix associated with dominant and insignificant singular values respectively. Driven by the discoveries above, we propose a simple and effective method, called Policy Path Trimming and Boosting (PPTB), as a general plug-in improvement to DRL algorithms. The key idea of PPTB is to periodically trim the policy learning path by canceling the policy updates in minor parameter directions, while boost the learning path by encouraging the advance in major directions. In experiments, we demonstrate the general and significant performance improvements brought by PPTB, when combined with TD3 and RAD in MuJoCo and DMC environments respectively.
翻译:了解政策的学习动态对于揭开强化学习(RL)的奥秘非常重要。对于Deep RL来说,这是特别关键但又具有挑战性的,可以从中获得对诸如低效率和学习不稳定抽样等臭名昭著问题的补救方法。在本文件中,我们研究典型DRL代理机构的政策网络在学习过程中如何通过实证调查每个政策参数的几种时间变化而演变。在典型的 MuJoCo 和 DeepMind 控制套件(DMC)的基准上,我们发现TD3和RAD代理商的共同现象:(1) 政策网络参数的活动高度不对称,政策网络在极少数主要参数方向上单步前进;(2) 参数更新和类似协调的改变会发生严重偏差,对所有次要参数方向进行观察。通过在政策学习路径上执行新的SVDVD,主要参数方向被确定为与主要和微不足道的单项值相关的正确矩阵。受上述发现驱使,我们提出了一个简单有效的方法,称为政策路径Trimming and Boutting(PPTB),作为PDR3 联合改进的插件,在学习主要政策方向上定期展示方向。</s>