Methods for reinforcement learning for recommendation (RL4Rec) are increasingly receiving attention as they can quickly adapt to user feedback. A typical RL4Rec framework consists of (1) a state encoder to encode the state that stores the users' historical interactions, and (2) an RL method to take actions and observe rewards. Prior work compared four state encoders in an environment where user feedback is simulated based on real-world logged user data. An attention-based state encoder was found to be the optimal choice as it reached the highest performance. However, this finding is limited to the actor-critic method, four state encoders, and evaluation-simulators that do not debias logged user data. In response to these shortcomings, we reproduce and expand on the existing comparison of attention-based state encoders (1) in the publicly available debiased RL4Rec SOFA simulator with (2) a different RL method, (3) more state encoders, and (4) a different dataset. Importantly, our experimental results indicate that existing findings do not generalize to the debiased SOFA simulator generated from a different dataset and a Deep Q-Network (DQN)-based method when compared with more state encoders.
翻译:由于建议(RL4Rec)的强化学习方法(RL4Rec)能够迅速适应用户反馈,因此越来越受到注意。典型的RL4Rec框架包括:(1) 用于编码存储用户历史互动的国家编码器,(2) 用于采取行动和观察奖励的RL方法。在根据真实世界登录用户数据模拟用户反馈的环境中,对四个州编码器进行了先前的比较。基于关注状态编码器在达到最高性能时被认为是最佳选择。然而,这一结果仅限于行为者-捷克方法、四个州编码器和评价模拟器,这些编码器不取消用户登录的数据。为了应对这些缺陷,我们复制并扩展现有基于关注状态编码器的比较 (1) 在公开提供的基于RL4Rec SOFA Simula 模拟器中,使用两种不同的RLL方法,(3) 更多的州编码器,和(4) 不同的数据集。重要的是,我们的实验结果表明,当从不同的数据和深度模型中生成不同数据时,现有结果没有向已断动的SOFA Q(与更深的Simlator) 生成了一种数据和以不同状态进行比较的Q的状态数据模型时,现有结果没有向已断动的NFA-Q。