In reinforcement learning, policy gradient algorithms optimize the policy directly and rely on sampling efficiently an environment. Nevertheless, while most sampling procedures are based on direct policy sampling, self-performance measures could be used to improve such sampling prior to each policy update. Following this line of thought, we introduce SAUNA, a method where non-informative transitions are rejected from the gradient update. The level of information is estimated according to the fraction of variance explained by the value function: a measure of the discrepancy between V and the empirical returns. In this work, we use this metric to select samples that are useful to learn from, and we demonstrate that this selection can significantly improve the performance of policy gradient methods. In this paper: (a) We define SAUNA's metric and introduce its method to filter transitions. (b) We conduct experiments on a set of benchmark continuous control problems. SAUNA significantly improves performance. (c) We investigate how SAUNA reliably selects samples with the most positive impact on learning and study its improvement on both performance and sample efficiency.
翻译:在强化学习中,政策梯度算法直接优化政策,并有效地依赖采样环境。然而,尽管大多数采样程序都以直接政策抽样为基础,但在每项政策更新之前,可以采用自我绩效措施改进这种采样。按照这一思路,我们引入了非信息性转变从梯度更新中被拒绝的方法SAUNA。信息水平根据价值函数所解释的差异部分估算:衡量V和实证回报之间的差异。在这项工作中,我们使用这一衡量标准来选择有益于学习的样本,我们证明这一选择可以大大改善政策梯度方法的绩效。在本文中:(a) 我们界定SAUNA的衡量标准,并介绍其过滤过渡的方法。 (b) 我们在一系列基准连续控制问题上进行实验。SAUNA大大改进了业绩。 (c) 我们调查SAUNA如何可靠地选择对学习和抽样效率的改进产生最积极的影响的样本。