对两个时间尺度的自然行为者 -- -- 竞争算法的简单抽样分析 (Finite Sample Analysis of Two-Time-Scale Natural Actor-Critic Algorithm)

Actor-critic style two-time-scale algorithms are very popular in reinforcement learning, and have seen great empirical success. However, their performance is not completely understood theoretically. In this paper, we characterize the global convergence of an online natural actor-critic algorithm in the tabular setting using a single trajectory. Our analysis applies to very general settings, as we only assume that the underlying Markov chain is ergodic under all policies (the so-called Recurrence assumption). We employ $\epsilon$-greedy sampling in order to ensure enough exploration. For a fixed exploration parameter $\epsilon$, we show that the natural actor critic algorithm is $\mathcal{O}(\frac{1}{\epsilon T^{1/4}}+\epsilon)$ close to the global optimum after $T$ iterations of the algorithm. By carefully diminishing the exploration parameter $\epsilon$ as the iterations proceed, we also show convergence to the global optimum at a rate of $\mathcal{O}(1/T^{1/6})$.

翻译：在强化学习中,Actor-critic 风格的双时间级算法非常受欢迎,并取得了巨大的实证成功。但是,它们的性能在理论上并不完全被人们所理解。在本文中,我们用单一的轨迹来描述表格环境中在线自然行为者-critic 算法的全球趋同。我们的分析适用于非常一般的设置, 我们仅仅假设所有政策( 所谓的Recurence 假设) 所基于的 Markov 链系都是自动的。我们使用 $\ epsilon $- greedy 的采样以确保足够的勘探。对于固定的勘探参数 $\ epselon$ 来说, 我们显示自然演员的批评算法是$\ mathcal{ O} (\\\\ {\\\ \ \\ \\\\\ \ \ \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

相关内容

马尔可夫链

关注 289

马尔可夫链，因安德烈·马尔可夫（A.A.Markov，1856－1922）得名，是指数学中具有马尔可夫性质的离散事件随机过程。该过程中，在给定当前知识或信息的情况下，过去（即当前以前的历史状态）对于预测将来（即当前以后的未来状态）是无关的。在马尔可夫链的每一步，系统根据概率分布，可以从一个状态变到另一个状态，也可以保持当前状态。状态的改变叫做转移，与不同的状态改变相关的概率叫做转移概率。随机漫步就是马尔可夫链的例子。随机漫步中每一步的状态是在图形中的点，每一步可以移动到任何一个相邻的点，在这里移动到每一个点的概率都是相同的（无论之前漫步路径是如何的）。

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日