Reinforcement learning (RL) experiments have notoriously high variance, and minor details can have disproportionately large effects on measured outcomes. This is problematic for creating reproducible research and also serves as an obstacle for real-world applications, where safety and predictability are paramount. In this paper, we investigate causes for this perceived instability. To allow for an in-depth analysis, we focus on a specifically popular setup with high variance -- continuous control from pixels with an actor-critic agent. In this setting, we demonstrate that variance mostly arises early in training as a result of poor "outlier" runs, but that weight initialization and initial exploration are not to blame. We show that one cause for early variance is numerical instability which leads to saturating nonlinearities. We investigate several fixes to this issue and find that one particular method is surprisingly effective and simple -- normalizing penultimate features. Addressing the learning instability allows for larger learning rates, and significantly decreases the variance of outcomes. This demonstrates that the perceived variance in RL is not necessarily inherent to the problem definition and may be addressed through simple architectural modifications.
翻译:强化学习(RL)实验差异极大,微小细节可能对测量结果产生不相称的极大影响。 这对创造可复制的研究有问题,同时也是真实世界应用的障碍,因为安全性和可预测性至关重要。 在本文中,我们调查了这种所感到的不稳定的原因。为了进行深入分析,我们把重点放在一个特别流行的、差异很大的结构上 -- -- 由具有行为者-批评剂的像素进行持续控制。在这种背景下,我们证明在培训初期,差异大多是由于“外”运行不力造成的,但权重初始化和初步探索是不容责怪罪的。我们表明,早期差异的一个原因是数字不稳定,导致不线性饱和。我们调查了这一问题的若干解决办法,发现一种特别的方法是出乎意料的有效和简单 -- -- 使倒数第二特征正常化。解决学习不稳定问题可以提高学习率,并显著减少结果差异。这说明,人们察觉的RL差异不一定是问题定义所固有的,可以通过简单的建筑修改加以解决。