We consider Model-Agnostic Meta-Learning (MAML) methods for Reinforcement Learning (RL) problems, where the goal is to find a policy using data from several tasks represented by Markov Decision Processes (MDPs) that can be updated by one step of stochastic policy gradient for the realized MDP. In particular, using stochastic gradients in MAML update steps is crucial for RL problems since computation of exact gradients requires access to a large number of possible trajectories. For this formulation, we propose a variant of the MAML method, named Stochastic Gradient Meta-Reinforcement Learning (SG-MRL), and study its convergence properties. We derive the iteration and sample complexity of SG-MRL to find an $\epsilon$-first-order stationary point, which, to the best of our knowledge, provides the first convergence guarantee for model-agnostic meta-reinforcement learning algorithms. We further show how our results extend to the case where more than one step of stochastic policy gradient method is used at test time. Finally, we empirically compare SG-MRL and MAML in several deep RL environments.
翻译:我们认为,加强学习的模型-不可测元学习(MAML)方法(MAML)对于加强学习的模型-不可测的梯度(MAML)问题至关重要,在这种方法中,我们的目标是利用由Markov决定程序(MDPs)代表的若干任务提供的数据来找到一项政策,这种数据可以通过对已实现的MDP的随机政策梯度的一步加以更新更新。特别是,在MAML更新步骤中使用随机梯度对于加强学习的模型-随机梯度(MAML)问题至关重要,因为精确梯度的计算需要获得大量可能的轨迹。关于这一公式,我们提出了一个MAML方法的变式,名为Stochatical 梯度学习(SG-MRL),并研究其趋同特性。我们从SG-MRL的迭代和样本复杂性中找到一个美元一等定点,根据我们的知识,这为模型-不可知元元元的元力学习算法提供了第一个趋同保证。我们进一步展示了我们的结果如何延伸到一个案例,在深层的MAML环境中使用超过一步的随机梯度政策梯度梯度方法。最后,我们进行了实验性研究。