Policy gradient methods are appealing in deep reinforcement learning but suffer from high variance of gradient estimate. To reduce the variance, the state value function is applied commonly. However, the effect of the state value function becomes limited in stochastic dynamic environments, where the unexpected state dynamics and rewards will increase the variance. In this paper, we propose to replace the state value function with a novel hindsight value function, which leverages the information from the future to reduce the variance of the gradient estimate for stochastic dynamic environments. Particularly, to obtain an ideally unbiased gradient estimate, we propose an information-theoretic approach, which optimizes the embeddings of the future to be independent of previous actions. In our experiments, we apply the proposed hindsight value function in stochastic dynamic environments, including discrete-action environments and continuous-action environments. Compared with the standard state value function, the proposed hindsight value function consistently reduces the variance, stabilizes the training, and improves the eventual policy.
翻译:政策梯度方法在深强化学习中具有吸引力,但差异很大。 为了减少差异,通常会应用州值函数。 但是, 州值函数的效果在随机动态环境中是有限的, 意想不到的状态动态和回报会增加差异。 在本文中, 我们提议用一个新的后视值功能取代州值函数, 该功能利用未来的信息来减少对随机动态环境的梯度估计差异。 特别是, 为了获得理想的公正梯度估计, 我们提议采用信息理论方法, 优化未来嵌入, 使之与以往行动无关。 在我们的实验中, 我们将拟议的后视值功能应用在随机动态环境中, 包括离散行动环境和连续行动环境。 与标准州值函数相比, 拟议的后视值功能会持续减少差异, 稳定培训, 并改进最终的政策 。