Reparameterization (RP) and likelihood ratio (LR) gradient estimators are used to estimate gradients of expectations throughout machine learning and reinforcement learning; however, they are usually explained as simple mathematical tricks, with no insight into their nature. We use a first principles approach to explain that LR and RP are alternative methods of keeping track of the movement of probability mass, and the two are connected via the divergence theorem. Moreover, we show that the space of all possible estimators combining LR and RP can be completely parameterized by a flow field $u(x)$ and an importance sampling distribution $q(x)$. We prove that there cannot exist a single-sample estimator of this type outside our characterized space, thus, clarifying where we should be searching for better Monte Carlo gradient estimators.
翻译:重新计量(RP)和可能性比率(LR)的梯度估计值用于估计机器学习和强化学习过程中预期值的梯度;然而,通常被解释为简单的数学伎俩,对其性质没有洞察力。我们使用第一种原则方法来解释,LR和RP是跟踪概率质量移动的替代方法,两者是通过差异理论连接起来的。此外,我们表明,将LR和RP相结合的所有可能的估计值的空间可以完全由流量字段$u(x)美元和重要抽样分布$q(x)美元来参数化。我们证明,在我们特有的空间之外,不可能存在这种类型的单一估计值,从而澄清了我们应该在哪里寻找更好的蒙特卡洛梯度估计值。