In recent years, gradient based Meta-RL (GMRL) methods have achieved remarkable successes in either discovering effective online hyperparameter for one single task (Xu et al., 2018) or learning good initialisation for multi-task transfer learning (Finn et al., 2017). Despite the empirical successes, it is often neglected that computing meta gradients via vanilla backpropagation is ill-defined. In this paper, we argue that the stochastic meta-gradient estimation adopted by many existing MGRL methods are in fact biased; the bias comes from two sources: 1) the compositional bias that is inborn in the structure of compositional optimisation problems and 2) the bias of multi-step Hessian estimation caused by direct automatic differentiation. To better understand the meta gradient biases, we perform the first of its kind study to quantify the amount for each of them. We start by providing a unifying derivation for existing GMRL algorithms, and then theoretically analyse both the bias and the variance of existing gradient estimation methods. On understanding the underlying principles of bias, we propose two mitigation solutions based on off-policy correction and multi-step Hessian estimation techniques. Comprehensive ablation studies have been conducted and results reveals: (1) The existence of these two biases and how they influence the meta-gradient estimation when combined with different estimator/sample size/step and learning rate. (2) The effectiveness of these mitigation approaches for meta-gradient estimation and thereby the final return on two practical Meta-RL algorithms: LOLA-DiCE and Meta-gradient Reinforcement Learning.
翻译:近年来,基于梯度的Met-RL(GMRL)方法在发现一个任务(Xu等人,2018年)的有效在线超参数或学习多任务转移学习的良好初始化(Finn等人,2017年)方面取得了显著的成功。尽管取得了一些成功经验,但人们往往忽视,通过香草背面反演法计算元梯度的错误定义不当。在本文中,我们争辩说,许多现有MGRL方法采用的随机超梯度元梯度估算事实上存在偏差;这种偏差来自两个来源:1) 成份性偏差是成因成份法在结构中(Xu等人,2018年)产生的,或学习多步Hesian估算的偏差(Finish Hesian) 。为了更好地理解元梯度偏差的偏差,我们进行首项研究,以量化每种梯度的数值。我们首先为现有的GMRLL算法提供统一推算,然后从理论上分析现有的梯度估算方法的偏差和差异性。关于偏差的根本原则,我们建议基于结构偏差的两种缓解办法,即基于结构偏差性降降降降法结构结构结构结构结构结构结构结构结构结构结构结构结构的偏差,在直接修正和变差率上进行两种推算法的推算法和多梯度估算。