Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instruction on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring the non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection on the consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulations and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method.
翻译:评估现行政策的业绩在许多领域,例如医学和经济学,对于早期停止在线试验和环境及时反馈提供关键指导,提供关键指导,说明如何及早停止在线试验和环境的及时反馈。在线学习的政策评价通过实时推断最佳政策(即价值)的平均值,从而吸引越来越多的注意力。然而,由于在线环境中产生的依赖性数据、未知的最佳政策以及适应性实验中复杂的探索和开发交易,这一问题特别具有挑战性。在本文中,我们的目标是克服在线学习政策评价中的这些困难。我们明确推断探索的可能性,以量化在常用的匪帮算法下探索非最佳行动的可能性。我们利用这种可能性,对每项行动下的在线有条件平均估计值(即价值)进行合理推论,并制订加倍可靠的间隔估计方法,以推断网上学习的估计最佳政策的价值。拟议的价值估计为一致性提供了双重保护,并且与瓦尔德式信心模型中的拟议数据模拟和模拟期相比,是正常的。模拟和模拟数据与模拟期数据,以模拟方式显示模拟数据的真实性。