Many problems involve the use of models which learn probability distributions or incorporate randomness in some way. In such problems, because computing the true expected gradient may be intractable, a gradient estimator is used to update the model parameters. When the model parameters directly affect a probability distribution, the gradient estimator will involve score function terms. This paper studies baselines, a variance reduction technique for score functions. Motivated primarily by reinforcement learning, we derive for the first time an expression for the optimal state-dependent baseline, the baseline which results in a gradient estimator with minimum variance. Although we show that there exist examples where the optimal baseline may be arbitrarily better than a value function baseline, we find that the value function baseline usually performs similarly to an optimal baseline in terms of variance reduction. Moreover, the value function can also be used for bootstrapping estimators of the return, leading to additional variance reduction. Our results give new insight and justification for why value function baselines and the generalized advantage estimator (GAE) work well in practice.
翻译:许多问题涉及使用模型来学习概率分布或以某种方式纳入随机性。在这些问题中,由于计算真实的预期梯度可能难以解决,因此使用梯度估计器来更新模型参数。当模型参数直接影响概率分布时,梯度估计器将涉及得分函数。本文研究基准,即分数的差分减少技术。主要通过强化学习,我们第一次得出了最佳状态依赖基线的表达方式,该基准导致梯度估计器产生最小差异。虽然我们表明有实例表明最佳基线可能任意优于价值函数基线,但我们发现,价值函数基线通常在减少差异方面与最佳基线相似。此外,值函数功能也可以用于计算回报的推车估计器,从而导致更多的差异减少。我们的结果提供了新的洞察力和理由,说明为何价值函数基线和普遍优势估计器(GAE)在实践中运作良好。