Off-policy learning is a framework for optimizing policies without deploying them, using data collected by another policy. In recommender systems, this is especially challenging due to the imbalance in logged data: some items are recommended and thus logged more frequently than others. This is further perpetuated when recommending a list of items, as the action space is combinatorial. To address this challenge, we study pessimistic off-policy optimization for learning to rank. The key idea is to compute lower confidence bounds on parameters of click models and then return the list with the highest pessimistic estimate of its value. This approach is computationally efficient and we analyze it. We study its Bayesian and frequentist variants, and overcome the limitation of unknown prior by incorporating empirical Bayes. To show the empirical effectiveness of our approach, we compare it to off-policy optimizers that use inverse propensity scores or neglect uncertainty. Our approach outperforms all baselines, is robust, and is also general.
翻译:离政策学习是优化政策而不部署政策的框架,它使用另一种政策收集的数据。在推荐者系统中,由于登录数据不平衡,这尤其具有挑战性:有些项目被推荐,因此比其他项目更经常登录。在建议项目列表时,这进一步延续,因为行动空间是组合的。为了应对这一挑战,我们研究悲观的离政策优化,以便学习排名。关键的想法是计算点击模型参数的较低信任度,然后以对其价值的悲观估计值返回列表。这个方法具有计算效率,我们分析它。我们研究了其贝叶和常客变异,并克服了先前未知的局限性,纳入了经验性贝叶。为了显示我们的方法的经验效果,我们将其与使用反偏向偏向分数或忽略不确定性的离政策优化者进行比较。我们的方法超越了所有基线,是稳健的,也是普遍的。