Off-policy evaluation methods are important in recommendation systems and search engines, where data collected under an existing logging policy is used to estimate the performance of a new proposed policy. A common approach to this problem is weighting, where data is weighted by a density ratio between the probability of actions given contexts in the target and logged policies. In practice, two issues often arise. First, many problems have very large action spaces and we may not observe rewards for most actions, and so in finite samples we may encounter a positivity violation. Second, many recommendation systems are not probabilistic and so having access to logging and target policy densities may not be feasible. To address these issues, we introduce the featurized embedded permutation weighting estimator. The estimator computes the density ratio in an action embedding space, which reduces the possibility of positivity violations. The density ratio is computed leveraging recent advances in normalizing flows and density ratio estimation as a classification problem, in order to obtain estimates which are feasible in practice.
翻译:在建议系统和搜索引擎中,非政策评价方法很重要,根据现行伐木政策收集的数据被用于估计拟议新政策的执行情况。这个问题的一个共同办法是加权,数据按目标政策和记录政策中行动概率之间的密度比加权。在实践中,经常出现两个问题。首先,许多问题有很大的行动空间,我们可能无法观察到大多数行动的奖励,因此,在有限的抽样中,我们可能会遇到一种假想的违反。第二,许多建议系统不具有概率性,因此可能无法利用伐木和目标政策密度。为了解决这些问题,我们采用了Faturized嵌入式的嵌入式加权计算器。估计器在嵌入空间的行动中计算密度比率,这减少了假想违反的可能性。密度比率是利用流动正常化和密度比率估计方面的最新进展作为分类问题计算出来的,以便获得在实践中可行的估计。