The powerful learning ability of deep neural networks enables reinforcement learning (RL) agents to learn competent control policies directly from high-dimensional and continuous environments. In theory, to achieve stable performance, neural networks assume i.i.d. inputs, which unfortunately does no hold in the general RL paradigm where the training data is temporally correlated and non-stationary. This issue may lead to the phenomenon of "catastrophic interference" and the collapse in performance as later training is likely to overwrite and interfer with previously learned policies. In this paper, we introduce the concept of "context" into single-task RL and develop a novel scheme, termed as Context Division and Knowledge Distillation (CDaKD) driven RL, to divide all states experienced during training into a series of contexts. Its motivation is to mitigate the challenge of aforementioned catastrophic interference in deep RL, thereby improving the stability and plasticity of RL models. At the heart of CDaKD is a value function, parameterized by a neural network feature extractor shared across all contexts, and a set of output heads, each specializing on an individual context. In CDaKD, we exploit online clustering to achieve context division, and interference is further alleviated by a knowledge distillation regularization term on the output layers for learned contexts. In addition, to effectively obtain the context division in high-dimensional state spaces (e.g., image inputs), we perform clustering in the lower-dimensional representation space of a randomly initialized convolutional encoder, which is fixed throughout training. Our results show that, with various replay memory capacities, CDaKD can consistently improve the performance of existing RL algorithms on classic OpenAI Gym tasks and the more complex high-dimensional Atari tasks, incurring only moderate computational overhead.
翻译:深心神经网络的强大学习能力使得中度学习(RL)代理器能够直接从高维和连续环境中学习精密的控制政策。在理论上,为了实现稳定性能,神经网络假定了i.d.d. 投入,不幸的是,在一般RL范式中,培训数据与时间相关且不固定,这在一般RL范式中并不存在。这个问题可能导致“灾难性干扰”现象,以及由于后来的培训可能会覆盖和干扰以往所学的政策而导致的性能崩溃。在本文中,我们将“Context”概念引入了单一任务RL,并开发了一种新型计划,称为“Cecondal Distrual Distrual Distruction”(CDKD) 驱动了所有在一系列背景中,通过不断的Squaldimal decional Drivalation,我们利用了在不断的Oral Drivalal Drial Expressional Expressional la, 我们利用了在不断的Oral-de dal Excideal Exliction la la ladeal la la lade 。