This paper proves that the episodic learning environment of every finite-horizon decision task has a unique steady state under any behavior policy, and that the marginal distribution of the agent's input indeed converges to the steady-state distribution in essentially all episodic learning processes. This observation supports an interestingly reversed mindset against conventional wisdom: While the existence of unique steady states was often presumed in continual learning but considered less relevant in episodic learning, it turns out their existence is guaranteed for the latter. Based on this insight, the paper unifies episodic and continual RL around several important concepts that have been separately treated in these two RL formalisms. Practically, the existence of unique and approachable steady state enables a general way to collect data in episodic RL tasks, which the paper applies to policy gradient algorithms as a demonstration, based on a new steady-state policy gradient theorem. Finally, the paper also proposes and experimentally validates a perturbation method that facilitates rapid steady-state convergence in real-world RL tasks.
翻译:本文证明,在任何行为政策下,每一项有限视距决定任务都有独特的偶发学习环境,在任何行为政策下,代理人投入的边际分布确实与基本上所有偶发学习过程中的稳定状态分布相趋同。 这一观察支持一种与传统智慧相悖的令人感兴趣的反向思维:虽然在不断学习中常常假定存在独特的稳定状态,但认为它们的存在在偶发学习中并不那么重要,但事实证明它们的存在是后者的保障。基于这一观察,该文件围绕在这两种RL形式主义中分别处理的若干重要概念,统一了隐化和持续RL。实际上,由于存在独特和可接近的稳定状态,因此能够以共发RL任务收集数据的一般方法,而该文件在新的稳定状态政策梯度理论基础上,将这种数据应用到政策梯度算法作为示范。最后,该文件还提出并实验性地验证了一种扰动方法,便利在现实世界的RL任务中迅速稳定地融合。