Non-stationarity can arise in Reinforcement Learning (RL) even in stationary environments. For example, most RL algorithms collect new data throughout training, using a non-stationary behaviour policy. Due to the transience of this non-stationarity, it is often not explicitly addressed in deep RL and a single neural network is continually updated. However, we find evidence that neural networks exhibit a memory effect where these transient non-stationarities can permanently impact the latent representation and adversely affect generalisation performance. Consequently, to improve generalisation of deep RL agents, we propose Iterated Relearning (ITER). ITER augments standard RL training by repeated knowledge transfer of the current policy into a freshly initialised network, which thereby experiences less non-stationarity during training. Experimentally, we show that ITER improves performance on the challenging generalisation benchmarks ProcGen and Multiroom.
翻译:即便在固定环境中,加强学习也可能出现不常态现象,例如,大多数RL算法利用非静态行为政策,在整个培训过程中收集新的数据。由于这种非静态现象的瞬间性,深长的RL和单一神经网络往往没有明确地解决这个问题,但不断更新。然而,我们发现有证据表明,神经网络表现出一种记忆效应,在这些瞬间非静态现象可能永久影响潜在代表性并不利地影响一般化性能的情况下。因此,为了改进深层RL代理物的普及性,我们提议采用异式再学习(ITER)。ITER通过反复将现行政策的知识转移至新的初始网络,从而增强标准RL培训,从而在培训过程中较少出现不常态现象。我们实验性地表明,ITER改进了具有挑战性的普及基准ProcGen和多室的绩效。