A central object of study in Reinforcement Learning (RL) is the Markovian policy, in which an agent's actions are chosen from a memoryless probability distribution, conditioned only on its current state. The family of Markovian policies is broad enough to be interesting, yet simple enough to be amenable to analysis. However, RL often involves more complex policies: ensembles of policies, policies over options, policies updated online, etc. Our main contribution is to prove that the occupancy measure of any non-Markovian policy, i.e., the distribution of transition samples collected with it, can be equivalently generated by a Markovian policy. This result allows theorems about the Markovian policy class to be directly extended to its non-Markovian counterpart, greatly simplifying proofs, in particular those involving replay buffers and datasets. We provide various examples of such applications to the field of Reinforcement Learning.
翻译:在加强学习(RL)方面,研究的中心目标是Markovian政策,根据这一政策,代理人的行动只能以当前状态为条件,从没有记忆的概率分布中选择,而仅以其目前状态为条件。Markovian政策的范围很广,足够有趣,但又简单,便于分析。然而,RL往往涉及更复杂的政策:政策组合、选择政策、在线更新的政策等等。我们的主要贡献是证明任何非Markovian政策(即随该政策收集的过渡样品的分布)的占用度,可以与Markovian政策相同。这一结果使得Markovian政策类的理论可以直接扩展至其非Markovian对应方,大大简化了证据,尤其是那些涉及重新使用缓冲和数据集的证明。我们向加强学习领域提供了各种应用这种方法的例子。