We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinite-horizon discounted Markov decision process (MDP). While existing analyses of common approaches, such as fitted $Q$-iteration (FQI), suggest a $O(1/\sqrt{n})$ convergence for regret, empirical behavior exhibits much faster convergence. In this paper, we present a finer regret analysis that exactly characterizes this phenomenon by providing fast rates for the regret convergence. First, we show that given any estimate for the optimal quality function $Q^*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q^*$-estimate's pointwise convergence rate, thus speeding it up. The level of exponentiation depends on the level of noise in the decision-making problem, rather than the estimation problem. We establish such noise levels for linear and tabular MDPs as examples. Second, we provide new analyses of FQI and Bellman residual minimization to establish the correct pointwise convergence guarantees. As specific cases, our results imply $O(1/n)$ regret rates in linear cases and $\exp(-\Omega(n))$ regret rates in tabular cases.
翻译:我们研究了从无穷的折扣Markov(MDP)决策程序中固定行为政策产生的离线数据中强化学习的遗憾。 虽然目前对通用方法的分析,例如“Q$-美元”的贴现(FQI),显示美元(1/\sqrt{n})的趋同为遗憾,但实证行为表现出了更快的趋同。在本文中,我们提出了一个微小的遗憾分析,通过为遗憾趋同提供快速的速率来为这一现象提供精确的特征。首先,我们表明,根据对最佳质量功能的任何估计,我们所定义的政策的遗憾以美元-美元估计数点趋同率给出的速率趋同,从而加速了这种趋同速度。 推同的程度取决于决策问题中的噪音程度,而不是估计问题。我们为线性和表性MDPs树立了这样的噪音水平,作为例子。第二,我们提供了对FQI和Bellman残余最小化的新分析,以确定正确的点趋同保证。作为具体案例,我们的结果表明,在线性案例和美元-美元-美元的列表中,表示遗憾率率。