离线强化学习迟缓速率 (Fast Rates for the Regret of Offline Reinforcement Learning)

We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinite-horizon discounted Markov decision process (MDP). While existing analyses of common approaches, such as fitted $Q$-iteration (FQI), suggest a $O(1/\sqrt{n})$ convergence for regret, empirical behavior exhibits much faster convergence. In this paper, we present a finer regret analysis that exactly characterizes this phenomenon by providing fast rates for the regret convergence. First, we show that given any estimate for the optimal quality function $Q^*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q^*$-estimate's pointwise convergence rate, thus speeding it up. The level of exponentiation depends on the level of noise in the decision-making problem, rather than the estimation problem. We establish such noise levels for linear and tabular MDPs as examples. Second, we provide new analyses of FQI and Bellman residual minimization to establish the correct pointwise convergence guarantees. As specific cases, our results imply $O(1/n)$ regret rates in linear cases and $\exp(-\Omega(n))$ regret rates in tabular cases.

翻译：我们研究了从无穷的折扣Markov(MDP)决策程序中固定行为政策产生的离线数据中强化学习的遗憾。虽然目前对通用方法的分析,例如“Q$-美元”的贴现(FQI),显示美元(1/\sqrt{n})的趋同为遗憾,但实证行为表现出了更快的趋同。在本文中,我们提出了一个微小的遗憾分析,通过为遗憾趋同提供快速的速率来为这一现象提供精确的特征。首先,我们表明,根据对最佳质量功能的任何估计,我们所定义的政策的遗憾以美元-美元估计数点趋同率给出的速率趋同,从而加速了这种趋同速度。推同的程度取决于决策问题中的噪音程度,而不是估计问题。我们为线性和表性MDPs树立了这样的噪音水平,作为例子。第二,我们提供了对FQI和Bellman残余最小化的新分析,以确定正确的点趋同保证。作为具体案例,我们的结果表明,在线性案例和美元-美元-美元的列表中,表示遗憾率率。

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

53+阅读 · 2020年9月7日

元迁移学习的小样本学习，Meta-transfer Learning for Few-shot Learning

专知会员服务

159+阅读 · 2020年2月29日

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

专知会员服务

84+阅读 · 2020年2月18日

TensorFlow深度学习，从线性回归到强化学习的深度学习（TensorFlow for Deep Learning From Linear Regression to Reinforcement Learning），附页256页pdf

专知会员服务

46+阅读 · 2020年1月1日