In recent years, there are great interests as well as challenges in applying reinforcement learning (RL) to recommendation systems (RS). In this paper, we summarize three key practical challenges of large-scale RL-based recommender systems: massive state and action spaces, high-variance environment, and the unspecific reward setting in recommendation. All these problems remain largely unexplored in the existing literature and make the application of RL challenging. We develop a model-based reinforcement learning framework, called GoalRec. Inspired by the ideas of world model (model-based), value function estimation (model-free), and goal-based RL, a novel disentangled universal value function designed for item recommendation is proposed. It can generalize to various goals that the recommender may have, and disentangle the stochastic environmental dynamics and high-variance reward signals accordingly. As a part of the value function, free from the sparse and high-variance reward signals, a high-capacity reward-independent world model is trained to simulate complex environmental dynamics under a certain goal. Based on the predicted environmental dynamics, the disentangled universal value function is related to the user's future trajectory instead of a monolithic state and a scalar reward. We demonstrate the superiority of GoalRec over previous approaches in terms of the above three practical challenges in a series of simulations and a real application.
翻译:近年来,在将强化学习(RL)应用到建议系统(RS)方面,存在着巨大的兴趣和挑战。在本文件中,我们总结了大规模基于RL的建议系统(RS)的三个关键实际挑战:巨大的州和行动空间、高差异环境和建议中不具体的奖赏设置。所有这些问题在现有文献中基本上尚未探讨,使RL的应用具有挑战性。我们开发了一个基于模型的强化学习框架,称为目标Rec。受世界模型(基于模型的)、价值函数估计(无模型的)和基于目标的RL等理念的启发,提出了为项目建议设计的新的、分解的普世价值功能。它可以概括建议者可能拥有的各种目标,并相应地区分杂乱的环境动态和高度差异奖励信号。作为价值功能的一部分,不受稀少和高差异奖励信号的影响,一个高能力、不独立的世界模型在某个目标下模拟复杂的环境动态。基于预测的环境动态,一个不相交织的普遍价值功能,即比前三个目标更高级的、比前三个目标更高级的模拟性目标,一个比前一个数字级的、比前一个数字级目标更高级的模型,一个比前一个比前一级的模拟目标更高级的目标,一个比前一个数字级,一个比前一个数字级的轨道比前一个比一个数字级的轨道比一个比一个数字级的轨道。