In model-based reinforcement learning, the transition matrix and reward vector are often estimated from random samples subject to noise. Even if the estimated model is an unbiased estimate of the true underlying model, the value function computed from the estimated model is biased. We introduce an operator shifting method for reducing the error introduced by the estimated model. When the error is in the residual norm, we prove that the shifting factor is always positive and upper bounded by $1+O\left(1/n\right)$, where $n$ is the number of samples used in learning each row of the transition matrix. We also propose a practical numerical algorithm for implementing the operator shifting.
翻译:在基于模型的强化学习中,过渡矩阵和奖励矢量往往是从随机抽样中估计的,受到噪音的影响。即使估计模型是对真实基本模型的不偏袒的估计,从估计模型中计算的价值功能也是偏差的。我们采用了操作者转移方法来减少估计模型的错误。当错误处于剩余规范时,我们证明变化系数总是正值,最高为1+O\left(1/n\right)美元,其中美元是用于学习每一行过渡矩阵的样本数量。我们还提出了实施操作者转移的实用数字算法。