Meta-gradients provide a general approach for optimizing the meta-parameters of reinforcement learning (RL) algorithms. Estimation of meta-gradients is central to the performance of these meta-algorithms, and has been studied in the setting of MAML-style short-horizon meta-RL problems. In this context, prior work has investigated the estimation of the Hessian of the RL objective, as well as tackling the problem of credit assignment to pre-adaptation behavior by making a sampling correction. However, we show that Hessian estimation, implemented for example by DiCE and its variants, always adds bias and can also add variance to meta-gradient estimation. Meanwhile, meta-gradient estimation has been studied less in the important long-horizon setting, where backpropagation through the full inner optimization trajectories is not feasible. We study the bias and variance tradeoff arising from truncated backpropagation and sampling correction, and additionally compare to evolution strategies, which is a recently popular alternative strategy to long-horizon meta-learning. While prior work implicitly chooses points in this bias-variance space, we disentangle the sources of bias and variance and present an empirical study that relates existing estimators to each other.
翻译:元梯度是优化强化学习算法(RL)元参数的通用方法。 元梯度的估算对于这些元数的性能至关重要,在确定MAML式短正正方圆元RL问题时已经研究过。 在这方面,先前的工作调查了RL目标的赫西安估计,以及通过抽样校正解决信用分配与适应前行为之间的偏差和差异权衡问题。然而,我们显示,赫森估计,例如DICE及其变异软件所实施的赫森估计,总是增加偏差,而且可能增加元梯度估计的差异。 同时,在重要的长正方圆环境中,对元梯度估计的研究较少,因为通过完全内部优化轨迹进行反向调整是不可行的。 我们研究的是曲解的反正调整和抽样校正所产生的偏差和差异权衡,以及进一步比较进化战略,后者是最近流行的替代战略,是长正向偏向偏差的元和偏差。 之前的工作隐含性地将其他的偏差与目前的经验偏差联系起来。