We consider Model-Agnostic Meta-Learning (MAML) methods for Reinforcement Learning (RL) problems where the goal is to find a policy using data from several tasks represented by Markov Decision Processes (MDPs) that can be updated by one step of stochastic policy gradient for the realized MDP. In particular, using stochastic gradients in MAML update step is crucial for RL problems since computation of exact gradients requires access to a large number of possible trajectories. For this formulation, we propose a variant of the MAML method, named Stochastic Gradient Meta-Reinforcement Learning (SG-MRL), and study its convergence properties. We derive the iteration and sample complexity of SG-MRL to find an $\epsilon$-first-order stationary point, which, to the best of our knowledge, provides the first convergence guarantee for model-agnostic meta-reinforcement learning algorithms. We further show how our results extend to the case where more than one step of stochastic policy gradient method is used in the update during the test time.
翻译:我们认为,“加强学习模型”方法(MAML)问题对于“加强学习模型(RL)”问题至关重要,因为“加强学习模型(MAML)方法”的目标是利用由Markov决定进程(MDPs)代表的若干任务提供的数据来找到一项政策,该数据可以通过对已实现的MDP的随机政策梯度的一步更新加以更新。特别是,在MAML更新步骤中使用随机梯度对于“加强学习模型”(MAML)问题至关重要,因为计算精确的梯度需要大量可能的轨迹。关于这一公式,我们提出了“MAML”方法的变式,名为“Stochastectical 梯度学习”(SG-MRL),并研究其趋同特性。我们从SG-MRL的迭代和样本复杂性中找到一个“$-epsilon$-一级固定点”,据我们所知,该点为模型-Annoci-me-striement 学习算法提供了第一个趋同保证。我们进一步说明我们的结果是如何延伸到测试期间在更新中使用超过一步的“随机政策梯度梯度梯度方法的情况。