对背景强盗的任意有效的非政策推论 (Anytime-valid off-policy inference for contextual bandits)

Contextual bandits are a modern staple tool for active sequential experimentation in the tech industry. They involve online learning algorithms that adaptively (over time) learn policies to map observed contexts $X_t$ to actions $A_t$ in an attempt to maximize stochastic rewards $R_t$. This adaptivity raises interesting but hard statistical inference questions, especially counterfactual ones: for example, it is often of interest to estimate the properties of a hypothetical policy that is different from the logging policy that was used to collect the data -- a problem known as "off-policy evaluation" (OPE). Using modern martingale techniques, we present a comprehensive framework for OPE inference that relax many unnecessary assumptions made in past work, significantly improving on them theoretically and empirically. Our methods remain valid in very general settings, and can be employed while the original experiment is still running (that is, not necessarily post-hoc), when the logging policy may be itself changing (due to learning), and even if the context distributions are drifting over time. More concretely, we derive confidence sequences for various functionals of interest in OPE. These include doubly robust ones for time-varying off-policy mean reward values, but also confidence bands for the entire CDF of the off-policy reward distribution. All of our methods (a) are valid at arbitrary stopping times (b) only make nonparametric assumptions, and (c) do not require known bounds on the maximal importance weights, and (d) adapt to the empirical variance of the reward and weight distributions. In summary, our methods enable anytime-valid off-policy inference using adaptively collected contextual bandit data.

翻译：环境土匪是科技产业中积极连续实验的现代主机工具。它们涉及在线学习算法,在时间上适应性(时间上)学习绘制观测环境的政策 $X_t$美元到行动$A_t$美元,试图最大限度地增加随机性回报$R_t$。这种适应性提出了有趣的但难以统计的推论问题,特别是反事实问题:例如,通常需要估计一种假设政策的性质,它不同于用于收集数据的伐木政策 -- -- 一个被称为“反政策评估”的问题。使用现代马丁格技术,我们为OPE提供了一个全面的推论框架,放松过去工作中的许多不必要的假设,在理论上和实验上大大改进。我们的方法在非常一般的环境下仍然有效,在最初的实验仍在进行期间(不一定是后休克),当伐木政策本身可能发生变化(由于学习的原因),而且即使背景分布在一段时间内不断漂移(更具体地),我们对于OPE中的各种功能的权重度评估过程,我们提出了一个全面的信心序列。这些方法还包括在OPE中,在理论上和时间上的稳健健的汇率上,这些方法也要求我们不比的平的汇率的汇率的汇率, 。这些推算法的汇率的比的比比的平比值。在CA值的平比值的平的比所有的汇率的平的平的平的比法。