In recommender systems, reinforcement learning solutions have effectively boosted recommendation performance because of their ability to capture long-term user-system interaction. However, the action space of the recommendation policy is a list of items, which could be extremely large with a dynamic candidate item pool. To overcome this challenge, we propose a hyper-actor and critic learning framework where the policy decomposes the item list generation process into a hyper-action inference step and an effect-action selection step. The first step maps the given state space into a vectorized hyper-action space, and the second step selects the item list based on the hyper-action. In order to regulate the discrepancy between the two action spaces, we design an alignment module along with a kernel mapping function for items to ensure inference accuracy and include a supervision module to stabilize the learning process. We build simulated environments on public datasets and empirically show that our framework is superior in recommendation compared to standard RL baselines.
翻译:在推荐者系统中,强化学习解决方案因其能够捕捉用户-系统的长期互动而有效提高了建议性能。然而,建议政策的行动空间是一份项目清单,如果有一个动态候选项目集合,则该清单可能非常大。为了克服这一挑战,我们提议了一个超级驱动器和评论器学习框架,使政策将项目生成过程分解成一个超行动推论步骤和一个效果-行动选择步骤。第一步将给定状态的空间映射成一个矢量化超行动空间,第二步则根据超行动选择项目列表。为了调节两个行动空间之间的差异,我们设计了一个匹配模块,同时配有一个内核绘图功能,以确保推断准确性,并包含一个监督模块,以稳定学习过程。我们在公共数据集上建立模拟环境,并从经验上表明,我们的框架比标准的 RL 基线更优于建议。