稳定的国家 " 速决强化学习分析 " (Steady State Analysis of Episodic Reinforcement Learning)

This paper proves that the episodic learning environment of every finite-horizon decision task has a unique steady state under any behavior policy, and that the marginal distribution of the agent's input indeed converges to the steady-state distribution in essentially all episodic learning processes. This observation supports an interestingly reversed mindset against conventional wisdom: While the existence of unique steady states was often presumed in continual learning but considered less relevant in episodic learning, it turns out their existence is guaranteed for the latter. Based on this insight, the paper unifies episodic and continual RL around several important concepts that have been separately treated in these two RL formalisms. Practically, the existence of unique and approachable steady state enables a general way to collect data in episodic RL tasks, which the paper applies to policy gradient algorithms as a demonstration, based on a new steady-state policy gradient theorem. Finally, the paper also proposes and experimentally validates a perturbation method that facilitates rapid steady-state convergence in real-world RL tasks.

翻译：本文证明,在任何行为政策下,每一项有限视距决定任务都有独特的偶发学习环境,在任何行为政策下,代理人投入的边际分布确实与基本上所有偶发学习过程中的稳定状态分布相趋同。这一观察支持一种与传统智慧相悖的令人感兴趣的反向思维:虽然在不断学习中常常假定存在独特的稳定状态,但认为它们的存在在偶发学习中并不那么重要,但事实证明它们的存在是后者的保障。基于这一观察,该文件围绕在这两种RL形式主义中分别处理的若干重要概念,统一了隐化和持续RL。实际上,由于存在独特和可接近的稳定状态,因此能够以共发RL任务收集数据的一般方法,而该文件在新的稳定状态政策梯度理论基础上,将这种数据应用到政策梯度算法作为示范。最后,该文件还提出并实验性地验证了一种扰动方法,便利在现实世界的RL任务中迅速稳定地融合。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【DeepMind】基于模型的强化学习，174页ppt，Model-Based Reinforcement Learning

专知会员服务

89+阅读 · 2021年1月12日

【AAAI2021】自校正Q学习，Self-correcting Q-Learning

专知会员服务

17+阅读 · 2020年12月4日

【ICML2020-伯克利】稳定非策略强化学习的表示，Representations for Stable Off-Policy Reinforcement Learning

专知会员服务

17+阅读 · 2020年7月14日

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

专知会员服务

112+阅读 · 2020年5月15日