Despite the empirical success of meta reinforcement learning (meta-RL), there are still a number poorly-understood discrepancies between theory and practice. Critically, biased gradient estimates are almost always implemented in practice, whereas prior theory on meta-RL only establishes convergence under unbiased gradient estimates. In this work, we investigate such a discrepancy. In particular, (1) We show that unbiased gradient estimates have variance $\Theta(N)$ which linearly depends on the sample size $N$ of the inner loop updates; (2) We propose linearized score function (LSF) gradient estimates, which have bias $\mathcal{O}(1/\sqrt{N})$ and variance $\mathcal{O}(1/N)$; (3) We show that most empirical prior work in fact implements variants of the LSF gradient estimates. This implies that practical algorithms "accidentally" introduce bias to achieve better performance; (4) We establish theoretical guarantees for the LSF gradient estimates in meta-RL regarding its convergence to stationary points, showing better dependency on $N$ than prior work when $N$ is large.
翻译:尽管元强化学习(meta-RL)取得了经验性的成功,但理论与实践之间仍然存在一些不尽人意的差别。关键是,偏差梯度估计几乎总是在实践中得到实施,而先前的元梯度理论只在无偏差的梯度估计数下取得趋同。我们调查了这种差异。特别是,(1) 我们表明,无偏差梯度估计数存在差异$Theta(N),这在线性上取决于内环更新的抽样规模($);(2) 我们提出了线性化得分函数(LSF)梯度估计数,其偏差为$mathcal{O}(1/Sqrt{N})美元和差异$mathcal{O}(1/N)美元;(3) 我们表明,大多数以往的经验性工作实际上采用了LSF梯度估计数的变式,这意味着实际的算法“明显地”引入偏差,以取得更好的性能;(4) 我们从理论上保证元-RL的L值梯度估计数与固定点的趋同,表明在美元数额很大时对美元的依赖性。