为建议系统改进地方政策 (Local Policy Improvement for Recommender Systems)

Recommender systems aim to answer the following question: given the items that a user has interacted with, what items will this user likely interact with next? Historically this problem is often framed as a predictive task via (self-)supervised learning. In recent years, we have seen more emphasis placed on approaching the recommendation problem from a policy optimization perspective: learning a policy that maximizes some reward function (e.g., user engagement). However, it is commonly the case in recommender systems that we are only able to train a new policy given data collected from a previously-deployed policy. The conventional way to address such a policy mismatch is through importance sampling correction, which unfortunately comes with its own limitations. In this paper, we suggest an alternative approach, which involves the use of local policy improvement without off-policy correction. Drawing from a number of related results in the fields of causal inference, bandits, and reinforcement learning, we present a suite of methods that compute and optimize a lower bound of the expected reward of the target policy. Crucially, this lower bound is a function that is easy to estimate from data, and which does not involve density ratios (such as those appearing in importance sampling correction). We argue that this local policy improvement paradigm is particularly well suited for recommender systems, given that in practice the previously-deployed policy is typically of reasonably high quality, and furthermore it tends to be re-trained frequently and gets continuously updated. We discuss some practical recipes on how to apply some of the proposed techniques in a sequential recommendation setting.

翻译：建议系统旨在回答下列问题:鉴于用户与哪些项目互动,哪些项目可能与下一个项目互动?从历史上看,这个问题往往被定义为通过(自我)监督学习的预测任务。近年来,我们从政策优化的角度更多地强调处理建议问题:学习一种最大限度地发挥某种奖励功能的政策(例如用户参与)。然而,在建议系统方面,我们通常只能根据从先前采用的政策收集的数据来培训新的政策。解决这种政策不匹配的传统方法是通过重要的抽样校正,不幸的是,这种校正本身也有其局限性。在本文件中,我们建议了一种替代方法,它涉及使用当地政策改进而不脱离政策校正。从因果关系、强盗和强化学习等领域的一些相关成果中,我们提出了一套方法,用以比较和优化预期实际奖励政策下限的一套方法。毫无疑问,这种较弱的约束性功能很容易从数据中估算出来,但不幸的是,这种校正的校正方法也不幸地产生出其局限性。我们经常建议采用的一种替代方法,而这种方法通常并不涉及密度比强的校正比率(例如,我们通常认为,在以往的校正政策中采用这种校正方法是比较重要。