Estimation of value in policy gradient methods is a fundamental problem. Generalized Advantage Estimation (GAE) is an exponentially-weighted estimator of an advantage function similar to $\lambda$-return. It substantially reduces the variance of policy gradient estimates at the expense of bias. In practical applications, a truncated GAE is used due to the incompleteness of the trajectory, which results in a large bias during estimation. To address this challenge, instead of using the entire truncated GAE, we propose to take a part of it when calculating updates, which significantly reduces the bias resulting from the incomplete trajectory. We perform experiments in MuJoCo and $\mu$RTS to investigate the effect of different partial coefficient and sampling lengths. We show that our partial GAE approach yields better empirical results in both environments.
翻译:政策梯度方法的价值估计是一个根本问题。 普遍优惠估计(GAE)是一个指数加权的优势函数估计值,类似于 $\ lambda$- return,它大大降低了政策梯度估计值的差异,但以偏差为代价。在实际应用中,由于轨迹不全,导致估算过程中存在很大的偏差,因此使用了短差的GAE。为了应对这一挑战,我们提议在计算更新时采用其中的一部分,以大幅降低不完全轨迹造成的偏差。我们在MuJoCo和$\mu$RTS进行实验,以调查不同部分系数和抽样长度的影响。我们表明,我们部分的GAE方法在两种环境中都取得了更好的经验结果。