Recommender System (RS) is an important online application that affects billions of users every day. The mainstream RS ranking framework is composed of two parts: a Multi-Task Learning model (MTL) that predicts various user feedback, i.e., clicks, likes, sharings, and a Multi-Task Fusion model (MTF) that combines the multi-task outputs into one final ranking score with respect to user satisfaction. There has not been much research on the fusion model while it has great impact on the final recommendation as the last crucial process of the ranking. To optimize long-term user satisfaction rather than obtain instant returns greedily, we formulate MTF task as Markov Decision Process (MDP) within a recommendation session and propose a Batch Reinforcement Learning (RL) based Multi-Task Fusion framework (BatchRL-MTF) that includes a Batch RL framework and an online exploration. The former exploits Batch RL to learn an optimal recommendation policy from the fixed batch data offline for long-term user satisfaction, while the latter explores potential high-value actions online to break through the local optimal dilemma. With a comprehensive investigation on user behaviors, we model the user satisfaction reward with subtle heuristics from two aspects of user stickiness and user activeness. Finally, we conduct extensive experiments on a billion-sample level real-world dataset to show the effectiveness of our model. We propose a conservative offline policy estimator (Conservative-OPEstimator) to test our model offline. Furthermore, we take online experiments in a real recommendation environment to compare performance of different models. As one of few Batch RL researches applied in MTF task successfully, our model has also been deployed on a large-scale industrial short video platform, serving hundreds of millions of users.
翻译:建议系统(RS) 是一个重要的在线应用程序, 每天影响数十亿用户。 主流RS 排名框架由两部分组成: 多任务学习模型( MTL), 预测各种用户反馈, 即点击、 喜欢、 共享、 多任务拼凑模型( MTF), 将多任务产出结合成用户满意度的最后分数。 在整合模型方面没有进行大量研究, 但它对最终的保守数据系统作为最后的关键排名程序产生了很大影响。 要优化长期用户满意度而不是获得即时回报, 我们将 MTF 任务设计成 Markov 决策程序( MDP), 在建议会话会中, 以 BatchE 强化学习模型( RL ), 并提议基于多任务拼凑的配置框架( BatchRL- MTF), 包括 Batch RL 框架和在线探索。 我们利用 BatchC 短任务模型从固定的批量数据向离线用户满意度学习最佳建议政策政策。 后, 我们用在线探索潜在的高价值行动, 以打破当地最有价值的行动, 最优的用户策略测试 。 我们用最接近的用户的系统测试平台, 向用户业绩环境 显示两个用户业绩 。 我们用一个全面的用户的完整的完整的 测试 测试的 。