It has been recently shown in the literature that the sample averages from online learning experiments are biased when used to estimate the mean reward. To correct the bias, off-policy evaluation methods, including importance sampling and doubly robust estimators, typically calculate the propensity score, which is unavailable in this setting due to unknown reward distribution and the adaptive policy. This paper provides a procedure to debias the samples using bootstrap, which doesn't require the knowledge of the reward distribution at all. Numerical experiments demonstrate the effective bias reduction for samples generated by popular multi-armed bandit algorithms such as Explore-Then-Commit (ETC), UCB, Thompson sampling and $\epsilon$-greedy. We also analyze and provide theoretical justifications for the procedure under the ETC algorithm, including the asymptotic convergence of the bias decay rate in the real and bootstrap worlds.
翻译:最近文献显示,在线学习实验的样本平均值在用来估计平均奖赏时有偏差。为了纠正偏差,非政策评价方法,包括重要抽样和双倍强的估测器,通常计算着偏差分数,而这种偏差分在这一背景下是未知的,因为奖赏分配和适应性政策不为人知。本文提供了一个用靴子贬低样品的程序,这根本不需要对奖赏分配的了解。数字实验表明,对流行的多武装强盗算法产生的样品,例如Explore-Theon-Committ(ETC)、UCB、Thompson抽样和$\epsilon$-greedy,有效减少了偏差。我们还分析和提供了电子计算算法下的程序的理论理由,包括实际和靴子世界中偏差衰败率的无症状趋同。