Industrial recommender systems deal with extremely large action spaces -- many millions of items to recommend. Moreover, they need to serve billions of users, who are unique at any point in time, making a complex user state space. Luckily, huge quantities of logged implicit feedback (e.g., user clicks, dwell time) are available for learning. Learning from the logged feedback is however subject to biases caused by only observing feedback on recommendations selected by the previous versions of the recommender. In this work, we present a general recipe of addressing such biases in a production top-K recommender system at Youtube, built with a policy-gradient-based algorithm, i.e. REINFORCE. The contributions of the paper are: (1) scaling REINFORCE to a production recommender system with an action space on the orders of millions; (2) applying off-policy correction to address data biases in learning from logged feedback collected from multiple behavior policies; (3) proposing a novel top-K off-policy correction to account for our policy recommending multiple items at a time; (4) showcasing the value of exploration. We demonstrate the efficacy of our approaches through a series of simulations and multiple live experiments on Youtube.
翻译:工业推荐人系统涉及巨大的行动空间 -- -- 数百万项需要推荐的项目。 此外,它们需要为数十亿用户服务,这些用户在任何时候都是独特的,因此拥有一个复杂的用户状态空间。幸运的是,有大量登录的隐含反馈(例如用户点击、占用时间)可供学习。但是,从记录反馈中学习的偏差是由于只观察对先前版本推荐人所选择的建议的反馈而产生的偏差。在这项工作中,我们提出了一个解决这种偏差的一般方案,即Youtube生产上K级推荐人系统中的偏差,该系统是用基于政策优先的算法建立的,即REINFORCE。本文的贡献是:(1) 将REINFORCE扩大为生产推荐人系统,在数百万人左右的基础上采取行动;(2) 应用非政策修正,在学习从从多个行为政策中采集的登录反馈时纠正数据偏差;(3) 提出一个新的顶级脱政策更正,以考虑我们何时建议多个项目的政策;(4) 展示探索的价值。我们通过一系列模拟和多次现场实验来展示我们的方法的有效性。