A long-standing challenge in artificial intelligence is lifelong learning. In lifelong learning, many tasks are presented in sequence and learners must efficiently transfer knowledge between tasks while avoiding catastrophic forgetting over long lifetimes. On these problems, policy reuse and other multi-policy reinforcement learning techniques can learn many tasks. However, they can generate many temporary or permanent policies, resulting in memory issues. Consequently, there is a need for lifetime-scalable methods that continually refine a policy library of a pre-defined size. This paper presents a first approach to lifetime-scalable policy reuse. To pre-select the number of policies, a notion of task capacity, the maximal number of tasks that a policy can accurately solve, is proposed. To evaluate lifetime policy reuse using this method, two state-of-the-art single-actor base-learners are compared: 1) a value-based reinforcement learner, Deep Q-Network (DQN) or Deep Recurrent Q-Network (DRQN); and 2) an actor-critic reinforcement learner, Proximal Policy Optimisation (PPO) with or without Long Short-Term Memory layer. By selecting the number of policies based on task capacity, D(R)QN achieves near-optimal performance with 6 policies in a 27-task MDP domain and 9 policies in an 18-task POMDP domain; with fewer policies, catastrophic forgetting and negative transfer are observed. Due to slow, monotonic improvement, PPO requires fewer policies, 1 policy for the 27-task domain and 4 policies for the 18-task domain, but it learns the tasks with lower accuracy than D(R)QN. These findings validate lifetime-scalable policy reuse and suggest using D(R)QN for larger and PPO for smaller library sizes.
翻译:人工智能的长期挑战是终身学习。在终身学习中,许多任务都是按顺序排列,学习者必须在任务之间有效地传授知识,同时避免长期的灾难性遗忘。在这些问题上,政策再利用和其他多政策强化学习技术可以学习许多任务。但是,它们可以产生许多临时或永久政策,从而产生记忆问题。因此,需要终生可扩展的方法,不断完善一个预定义规模的政策库。本文件提出了终身可调整政策再利用的第一个方法。为了预先选择政策的数量、任务能力的概念、政策可以准确解决的最大任务数目。为了评估使用这种方法的终身政策再利用,提出了两个最先进的单一行为者基础学习技术,从而导致记忆问题。因此,需要基于价值的强化学习器、深度的QNetwork(DQN)或深度的QNetwork(DRQQN);以及2) 以实绩强化学习器、Proximal化政策(PPPO)和接近或没有短期内内存储的OLO 4域政策,需要用一个更低的DMDS-del-del-deal政策在18级上实现。