Differentiable simulators promise faster computation time for reinforcement learning by replacing zeroth-order gradient estimates of a stochastic objective with an estimate based on first-order gradients. However, it is yet unclear what factors decide the performance of the two estimators on complex landscapes that involve long-horizon planning and control on physical systems, despite the crucial relevance of this question for the utility of differentiable simulators. We show that characteristics of certain physical systems, such as stiffness or discontinuities, may compromise the efficacy of the first-order estimator, and analyze this phenomenon through the lens of bias and variance. We additionally propose an $\alpha$-order gradient estimator, with $\alpha \in [0,1]$, which correctly utilizes exact gradients to combine the efficiency of first-order estimates with the robustness of zero-order methods. We demonstrate the pitfalls of traditional estimators and the advantages of the $\alpha$-order estimator on some numerical examples.
翻译:不同的模拟器通过以一阶梯度的估计数取代对随机目标的零级梯度估计值,为强化学习提供更快的计算时间;然而,尚不清楚的是,在涉及长和规划和控制物理系统的复杂景观上,两个估计器的性能是由哪些因素决定的,尽管这个问题对于不同模拟器的实用性至关重要。我们表明,某些物理系统的特性,例如僵硬性或不连续性,可能会损害第一阶估测器的功效,并通过偏差和差异的透镜分析这种现象。我们还提议用美元/阿尔法=[0,1美元]作为计算器,正确使用精确的梯度,将一级估计的效率与零顺序方法的稳健性结合起来。我们在某些数字例子中展示了传统估测器的陷井和美元/阿尔法$-测序的优点。