Bayesian policy reuse (BPR) is a general policy transfer framework for selecting a source policy from an offline library by inferring the task belief based on some observation signals and a trained observation model. In this paper, we propose an improved BPR method to achieve more efficient policy transfer in deep reinforcement learning (DRL). First, most BPR algorithms use the episodic return as the observation signal that contains limited information and cannot be obtained until the end of an episode. Instead, we employ the state transition sample, which is informative and instantaneous, as the observation signal for faster and more accurate task inference. Second, BPR algorithms usually require numerous samples to estimate the probability distribution of the tabular-based observation model, which may be expensive and even infeasible to learn and maintain, especially when using the state transition sample as the signal. Hence, we propose a scalable observation model based on fitting state transition functions of source tasks from only a small number of samples, which can generalize to any signals observed in the target task. Moreover, we extend the offline-mode BPR to the continual learning setting by expanding the scalable observation model in a plug-and-play fashion, which can avoid negative transfer when faced with new unknown tasks. Experimental results show that our method can consistently facilitate faster and more efficient policy transfer.
翻译:贝叶斯政策再利用(BPR)是从离线图书馆选择源政策的一般政策转移框架,通过根据某些观测信号和经过培训的观测模型推断任务信念,从离线图书馆选择源政策。在本文件中,我们提出改进的BPR方法,以便在深层增援学习(DRL)中实现更有效的政策转移。首先,多数BPR算法使用附带回报作为观测信号,该信号包含的信息有限,在事件结束前无法获取。相反,我们使用州过渡样本,该样本信息丰富,瞬时即时,作为更快和更准确任务推断的观测信号。第二,BPR算法通常需要许多样本来估计基于表格的观测模型的概率分布,该模型可能费用昂贵,甚至无法学习和维持,特别是当使用州过渡样本作为信号时。因此,我们提出了一个可扩展的观测模型,基于仅从少量的样本中获取的源任务的国家过渡功能,可以概括到目标任务中所观察到的任何信号。此外,我们将离线式BPR算法扩展为持续学习,通过扩大可扩展的可扩展式的观测模式,从而避免以更迅速的实验性转移方法,从而避免以更快速地展示新的定位。