Contextual bandit algorithms are ubiquitous tools for active sequential experimentation in healthcare and the tech industry. They involve online learning algorithms that adaptively learn policies over time to map observed contexts $X_t$ to actions $A_t$ in an attempt to maximize stochastic rewards $R_t$. This adaptivity raises interesting but hard statistical inference questions, especially counterfactual ones: for example, it is often of interest to estimate the properties of a hypothetical policy that is different from the logging policy that was used to collect the data -- a problem known as ``off-policy evaluation'' (OPE). Using modern martingale techniques, we present a comprehensive framework for OPE inference that relax many unnecessary assumptions made in past work, significantly improving on them both theoretically and empirically. Importantly, our methods can be employed while the original experiment is still running (that is, not necessarily post-hoc), when the logging policy may be itself changing (due to learning), and even if the context distributions are a highly dependent time-series (such as if they are drifting over time). More concretely, we derive confidence sequences for various functionals of interest in OPE. These include doubly robust ones for time-varying off-policy mean reward values, but also confidence bands for the entire CDF of the off-policy reward distribution. All of our methods (a) are valid at arbitrary stopping times (b) only make nonparametric assumptions, (c) do not require known bounds on the maximal importance weights, and (d) adapt to the empirical variance of our estimators. In summary, our methods enable anytime-valid off-policy inference using adaptively collected contextual bandit data.
翻译:环境土匪算法是医保和技术行业积极连续实验的无处不在的工具。它们涉及在线学习算法,在一段时间内适应性地学习政策,绘制观测环境的地图 $X$t$t$t$t$A$t$t$A$t$t$A$t$t$A$t$t$,试图最大限度地增加随机性回报的收益。这种适应性提出了有趣的但难于统计推论的问题,特别是反事实问题:例如,通常人们感兴趣的是估计一种假设政策的性质,它不同于用于收集数据的伐木政策 -- -- 这个问题被称为“反政策评价 ” (OPE OPE ) 。使用现代马丁醇技术,我们提出了一个综合框架,用来推导出过去工作中的许多不必要的假设,在理论上和实验上都大大改进。重要的是,在最初的实验仍在进行期间,我们的方法可以使用(也就是说,不一定是事后收集的),当伐木政策本身可能改变的时候(由于学习的原因),即使背景分布是一个高度依赖时间序列(例如它们正在漂移,而不是整个时间流)。 更具体地说,我们在时间段里,我们的价值序列中,我们需要用可靠的计算。