Efficient methods to evaluate new algorithms are critical for improving interactive bandit and reinforcement learning systems such as recommendation systems. A/B tests are reliable, but are time- and money-consuming, and entail a risk of failure. In this paper, we develop an alternative method, which predicts the performance of algorithms given historical data that may have been generated by a different algorithm. Our estimator has the property that its prediction converges in probability to the true performance of a counterfactual algorithm at a rate of $\sqrt{N}$, as the sample size $N$ increases. We also show a correct way to estimate the variance of our prediction, thus allowing the analyst to quantify the uncertainty in the prediction. These properties hold even when the analyst does not know which among a large number of potentially important state variables are actually important. We validate our method by a simulation experiment about reinforcement learning. We finally apply it to improve advertisement design by a major advertisement company. We find that our method produces smaller mean squared errors than state-of-the-art methods.
翻译:评估新算法的有效方法对于改进互动式强盗和强化学习系统(如建议系统)至关重要。 A/B测试是可靠的,但耗费时间和金钱,并有失败的风险。在本文件中,我们开发了一种替代方法,预测算法的性能,给出了可能由不同算法产生的历史数据。我们的估测器拥有其预测与反事实算法真实性能的概率一致的属性,随着样本规模的增加,其价格将达到$\sqrt{N}。我们还展示了一种正确的方法来估计我们的预测差异,从而允许分析师量化预测中的不确定性。这些属性即使分析师不知道大量潜在重要的国家变量中哪些是实际重要的。我们通过关于强化学习的模拟实验来验证我们的方法。我们最后运用它来改进大型广告公司广告设计。我们发现我们的方法产生的平均正方差比状态方法要小。