We are interested in learning models of non-stationary environments, which can be framed as a multi-task learning problem. Model-free reinforcement learning algorithms can achieve good asymptotic performance in multi-task learning at a cost of extensive sampling, due to their approach, which requires learning from scratch. While model-based approaches are among the most data efficient learning algorithms, they still struggle with complex tasks and model uncertainties. Meta-reinforcement learning addresses the efficiency and generalization challenges on multi task learning by quickly leveraging the meta-prior policy for a new task. In this paper, we propose a meta-reinforcement learning approach to learn the dynamic model of a non-stationary environment to be used for meta-policy optimization later. Due to the sample efficiency of model-based learning methods, we are able to simultaneously train both the meta-model of the non-stationary environment and the meta-policy until dynamic model convergence. Then, the meta-learned dynamic model of the environment will generate simulated data for meta-policy optimization. Our experiment demonstrates that our proposed method can meta-learn the policy in a non-stationary environment with the data efficiency of model-based learning approaches while achieving the high asymptotic performance of model-free meta-reinforcement learning.
翻译:我们感兴趣的是非静止环境的学习模型,这种模型可以作为一个多任务学习问题来设计。无模型强化学习算法可以以广泛的抽样成本,在多任务学习中取得良好的无症状性表现,因为其方法需要从零开始学习。虽然基于模型的方法是数据效率最高的学习算法之一,但它们仍然与复杂的任务和模型的不确定性作斗争。元加强学习通过迅速利用元优先政策进行一项新任务,解决多任务学习的效率和一般化挑战。在本文件中,我们建议采用一种元加强学习方法,学习非固定环境动态模型,以后用于元政策优化。由于基于模型的学习方法具有抽样效率,我们可以同时培训非静止环境的元模型和元政策,直到动态模型趋同。然后,从元学习环境的动态动态模型将产生模拟数据优化。我们的实验表明,我们提出的方法可以在非静止环境中将政策元化,同时实现基于安全模式的高级学习方法。