Experience replay \citep{lin1993reinforcement, mnih2015human} is a widely used technique to achieve efficient use of data and improved performance in RL algorithms. In experience replay, past transitions are stored in a memory buffer and re-used during learning. Various suggestions for sampling schemes from the replay buffer have been suggested in previous works, attempting to optimally choose those experiences which will most contribute to the convergence to an optimal policy. Here, we give some conditions on the replay sampling scheme that will ensure convergence, focusing on the well-known Q-learning algorithm in the tabular setting. After establishing sufficient conditions for convergence, we turn to suggest a slightly different usage for experience replay - replaying memories in a biased manner as a means to change the properties of the resulting policy. We initiate a rigorous study of experience replay as a tool to control and modify the properties of the resulting policy. In particular, we show that using an appropriate biased sampling scheme can allow us to achieve a \emph{safe} policy. We believe that using experience replay as a biasing mechanism that allows controlling the resulting policy in desirable ways is an idea with promising potential for many applications.
翻译:经验重现 \ citep{ lin1993 referencement, mnih2015human} 是广泛使用的一种技术, 以实现数据的高效使用和改进RL 算法的性能。 在经验重放中, 过去过渡被存储在记忆缓冲中, 并在学习期间再次使用。 在先前的作品中, 对重放缓冲的抽样方案提出了各种建议, 试图以最佳方式选择那些最有助于形成最佳政策统一的经验。 在这里, 我们给重播抽样方案提供一些条件, 以确保趋同, 重点是列表环境中众所周知的Q- 学习算法。 在为趋同创造足够的条件后, 我们转而建议对经验重播略为不同的使用 -- 以偏见的方式重现记忆, 以此改变所产生政策的性质。 我们开始对重播经验进行严格研究, 作为控制和修改所产生政策属性的工具。 特别是, 我们表明, 使用适当的偏移抽样方案可以让我们实现 \ emph{ safefelf} 政策。 我们认为, 将经验重放作为偏向偏向性机制, 使得能够以许多可取的方式控制由此产生的政策。