In real-world recommender systems and search engines, optimizing ranking decisions to present a ranked list of relevant items is critical. Off-policy evaluation (OPE) for ranking policies is thus gaining a growing interest because it enables performance estimation of new ranking policies using only logged data. Although OPE in contextual bandits has been studied extensively, its naive application to the ranking setting faces a critical variance issue due to the huge item space. To tackle this problem, previous studies introduce some assumptions on user behavior to make the combinatorial item space tractable. However, an unrealistic assumption may, in turn, cause serious bias. Therefore, appropriately controlling the bias-variance tradeoff by imposing a reasonable assumption is the key for success in OPE of ranking policies. To achieve a well-balanced bias-variance tradeoff, we propose the Cascade Doubly Robust estimator building on the cascade assumption, which assumes that a user interacts with items sequentially from the top position in a ranking. We show that the proposed estimator is unbiased in more cases compared to existing estimators that make stronger assumptions. Furthermore, compared to a previous estimator based on the same cascade assumption, the proposed estimator reduces the variance by leveraging a control variate. Comprehensive experiments on both synthetic and real-world data demonstrate that our estimator leads to more accurate OPE than existing estimators in a variety of settings.
翻译:在现实世界推荐人系统和搜索引擎中,优化排名决定以提出排名相关项目列表至关重要。因此,排名政策的非政策评价(OPE)越来越引起人们的兴趣,因为它能够利用仅登录的数据对新的排名政策进行业绩评估。虽然背景强盗中的OPE进行了广泛研究,但其在排名中的天真的应用由于项目空间巨大而面临一个重大差异问题。为了解决这一问题,先前的研究引入了一些关于用户行为的一些假设,以使组合项目可以拉动空间。然而,不切实际的假设可能反过来导致严重偏差。因此,通过合理假设适当控制偏差交易是排名政策OPE成功的关键。为了实现平衡的偏差和偏差权衡,我们提议在序列假设的基础上建立卡萨达德·杜伯利罗布乌斯特测算器,该假设用户与排名最高的项目相依次进行互动。我们提议的估算师在更多的案例中,与现有测算师相比,可能造成严重偏差的偏差。此外,与先前的估测算师相比,我们提出的定级政策中测算师的比实际测算师更能减少目前对世界测算的测算。