In model-based reinforcement learning, the transition matrix and reward vector are often estimated from random samples subject to noise. Even if the estimated model is an unbiased estimate of the true underlying model, the value function computed from the estimated model is biased. We introduce an operator augmentation method for reducing the error introduced by the estimated model. When the error is in the residual norm, we prove that the augmentation factor is always positive and upper bounded by $1 + O (1/n)$, where n is the number of samples used in learning each row of the transition matrix. We also propose a practical numerical algorithm for implementing the operator augmentation.
翻译:在基于模型的强化学习中,过渡矩阵和奖励矢量往往是从随机抽样中估计的,有噪音。即使估计模型是对真实基本模型的不偏袒估计,从估计模型中计算的价值功能也是偏差的。我们采用了操作者增强方法来减少估计模型的错误。当错误在剩余规范中时,我们证明增加系数总是正数,上限为1美元+O(1/n)美元,其中n是用于学习每一行过渡矩阵的样本数量。我们还提出了实施操作者增强的实用数字算法。