受监督的推荐人系统高级倡导者-评论员系统 (Supervised Advantage Actor-Critic for Recommender Systems)

from arxiv, 9 pages, 4 figures, In Proceedings of the 15th ACM International Conference on Web Search and Data Mining (WSDM '22), February 21-25, 2022, Phoenix, Arizona. arXiv admin note: text overlap with arXiv:2006.05779

Casting session-based or sequential recommendation as reinforcement learning (RL) through reward signals is a promising research direction towards recommender systems (RS) that maximize cumulative profits. However, the direct use of RL algorithms in the RS setting is impractical due to challenges like off-policy training, huge action spaces and lack of sufficient reward signals. Recent RL approaches for RS attempt to tackle these challenges by combining RL and (self-)supervised sequential learning, but still suffer from certain limitations. For example, the estimation of Q-values tends to be biased toward positive values due to the lack of negative reward signals. Moreover, the Q-values also depend heavily on the specific timestamp of a sequence. To address the above problems, we propose negative sampling strategy for training the RL component and combine it with supervised sequential learning. We call this method Supervised Negative Q-learning (SNQN). Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case, which can be further utilized as a normalized weight for learning the supervised sequential part. This leads to another learning framework: Supervised Advantage Actor-Critic (SA2C). We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets. Experimental results show that the proposed approaches achieve significantly better performance than state-of-the-art supervised methods and existing self-supervised RL methods . Code will be open-sourced.

翻译：通过奖励信号进行强化学习(RL)是学习学习的立足点或顺序建议,通过奖励信号进行强化学习(RL)是一个大有希望的研究方向,向建议系统(RS)倾斜,以最大限度地增加累积利润;然而,在RS环境中直接使用RL算法是不切实际的,因为存在非政策培训、巨大的行动空间和缺乏足够的奖励信号等挑战。最近对RS的RL方法试图通过将RL和(自我)监督的顺序学习相结合来应对这些挑战。例如,由于缺乏消极奖励信号,对Q值的估计倾向于偏向正值。此外,Q值还在很大程度上取决于一个序列的具体时间戳。为了解决上述问题,我们提出了用于培训RL部分的负面抽样战略,并将其与监督的顺序学习相结合。我们称之为超视式负式Q学习(SNQN)方法。根据抽样(负式)行动(项目),我们可以计算对平均案例采取的“良性”行动,这可以被进一步用作学习受监督的NVL2级连续运行的自我评估方法。我们用到另一个学习的SAL-SA-SAL-SAL-SAL-SAL-SAL-SAL-SAL-SAL-SAL-SAL-C-C-SAL-C-C-C-SAL-C-SAL-SAL-C-C-C-C-C-C-SAL-C-C-C-C-C-C-SAL-C-C-C-C-C-C-SAL-C-C-C-C-C-C-C-C-C-C-C-C-C-T-C-C-C-C-T-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-T-T-T-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-SAL-C-C-C-C-C-C-