Deep Policy Gradient (PG) algorithms employ value networks to drive the learning of parameterized policies and reduce the variance of the gradient estimates. However, value function approximation gets stuck in local optima and struggles to fit the actual return, limiting the variance reduction efficacy and leading policies to sub-optimal performance. This paper focuses on improving value approximation and analyzing the effects on Deep PG primitives such as value prediction, variance reduction, and correlation of gradient estimates with the true gradient. To this end, we introduce a Value Function Search that employs a population of perturbed value networks to search for a better approximation. Our framework does not require additional environment interactions, gradient computations, or ensembles, providing a computationally inexpensive approach to enhance the supervised learning task on which value networks train. Crucially, we show that improving Deep PG primitives results in improved sample efficiency and policies with higher returns using common continuous control benchmark domains.
翻译:深政策梯度算法(PG) 使用价值网络推动学习参数化政策并减少梯度估计的差异。 但是, 价值函数近似会被困在本地的奥地图中, 并努力适应实际回报, 限制差异减少效果, 并引导政策达到亚最佳性能 。 本文的重点是改进价值近地图, 分析对深PG原始生物的影响, 如价值预测、 差异减少、 梯度估计与真实梯度的关联。 为此, 我们引入了价值函数搜索, 使用受渗透的价值网络群来寻找更好的近地点。 我们的框架不需要额外的环境互动、 梯度计算或套装, 提供计算成本低廉的方法, 来强化价值网络培训的监督学习任务 。 显而易见, 我们显示, 改善深度 PG 原始能提高样本效率和政策, 使用共同的连续控制基准域来提高回报率 。