Pre-training with offline data and online fine-tuning using reinforcement learning is a promising strategy for learning control policies by leveraging the best of both worlds in terms of sample efficiency and performance. One natural approach is to initialize the policy for online learning with the one trained offline. In this work, we introduce a policy expansion scheme for this task. After learning the offline policy, we use it as one candidate policy in a policy set. We then expand the policy set with another policy which will be responsible for further learning. The two policies will be composed in an adaptive manner for interacting with the environment. With this approach, the policy previously learned offline is fully retained during online learning, thus mitigating the potential issues such as destroying the useful behaviors of the offline policy in the initial stage of online learning while allowing the offline policy participate in the exploration naturally in an adaptive manner. Moreover, new useful behaviors can potentially be captured by the newly added policy through learning. Experiments are conducted on a number of tasks and the results demonstrate the effectiveness of the proposed approach.
翻译:利用强化学习进行离线前数据和在线微调的培训是学习控制政策的一个很有希望的战略,通过在抽样效率和业绩方面利用两个世界的最佳力量来进行学习控制政策。一种自然的方法是,与受过训练的离线者一起启动在线学习政策。在这项工作中,我们为这项任务引入了政策扩展计划。在学习离线政策后,我们将其用作一套政策中的一个候选政策。然后,我们将将这套政策扩大为另一套政策,该套政策将负责进一步学习。两种政策将以适应方式构成,以便与环境互动。采用这一方法,在网上学习期间,完全保留了以前学到的离线政策,从而缓解了潜在问题,如在网上学习的初始阶段摧毁离线政策的有益行为,同时允许离线政策以适应的方式自然地参与勘探。此外,新增加的政策可以通过学习来捕捉新的有益行为。对一些任务进行了实验,结果显示拟议方法的有效性。