Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems, since it enables offline evaluation of new policies using only historic log data. Unfortunately, when the number of actions is large, existing OPE estimators -- most of which are based on inverse propensity score weighting -- degrade severely and can suffer from extreme bias and variance. This foils the use of OPE in many applications from recommender systems to language models. To overcome this issue, we propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space. We characterize the bias, variance, and mean squared error of the proposed estimator and analyze the conditions under which the action embedding provides statistical benefits over conventional estimators. In addition to the theoretical analysis, we find that the empirical performance improvement can be substantial, enabling reliable OPE even when existing estimators collapse due to a large number of actions.
翻译:环境强盗的非政策评价(OPE)在现实世界系统中被迅速采用,因为它只允许使用历史日志数据对新政策进行离线评估。不幸的是,当行动数量大时,现有的OPE估计者(其中大部分基于逆向偏差分分加权法)严重降解,并可能受到极端偏差和差异的影响。这阻碍了OPE在从推荐者系统到语言模型的许多应用中的应用中的应用。为了克服这一问题,我们提议一个新的OPE估计者,在行动嵌入提供行动空间的结构时,利用边缘化重要性权重。我们确定拟议估算者的偏差、差异和平均正方差错误,并分析行动嵌入在哪些条件下比常规估计者提供统计效益。除了理论分析外,我们发现经验性绩效的改进可以是实质性的,即使现有估计者由于大量行动而崩溃,也使得可靠的OPE得以使用。