Modern recommender systems aim to improve user experience. As reinforcement learning (RL) naturally fits this objective -- maximizing an user's reward per session -- it has become an emerging topic in recommender systems. Developing RL-based recommendation methods, however, is not trivial due to the \emph{offline training challenge}. Specifically, the keystone of traditional RL is to train an agent with large amounts of online exploration making lots of `errors' in the process. In the recommendation setting, though, we cannot afford the price of making `errors' online. As a result, the agent needs to be trained through offline historical implicit feedback, collected under different recommendation policies; traditional RL algorithms may lead to sub-optimal policies under these offline training settings. Here we propose a new learning paradigm -- namely Prompt-Based Reinforcement Learning (PRL) -- for the offline training of RL-based recommendation agents. While traditional RL algorithms attempt to map state-action input pairs to their expected rewards (e.g., Q-values), PRL directly infers actions (i.e., recommended items) from state-reward inputs. In short, the agents are trained to predict a recommended item given the prior interactions and an observed reward value -- with simple supervised learning. At deployment time, this historical (training) data acts as a knowledge base, while the state-reward pairs are used as a prompt. The agents are thus used to answer the question: \emph{ Which item should be recommended given the prior interactions \& the prompted reward value}? We implement PRL with four notable recommendation models and conduct experiments on two real-world e-commerce datasets. Experimental results demonstrate the superior performance of our proposed methods.
翻译:现代推荐人系统旨在改进用户经验。 强化学习( RL) 自然符合这一目标, 使用户每期的奖赏最大化, 这已成为推荐人系统中一个新兴话题。 但是, 开发基于 RL 的建议方法并非微不足道, 因为这样的非在线培训挑战 。 具体地说, 传统RL 的基石是培训一个拥有大量在线探索的代理商, 从而在这个过程中产生大量的“ 高级” 。 但是, 在建议设置中, 我们买不起在网上制作“ 高级” 的“ 高级” 。 因此, 代理商需要通过不同建议政策下收集的离线历史隐含反馈来接受培训? 传统的 RL 算法可能会在这些离线培训设置下导致次级最佳的政策。 我们在这里提出了一个新的学习模式 -- 即快速强化学习( PRL), 用于对基于 RL 推荐的推荐代理商进行离线性培训。 虽然 传统的 RL 算法试图将州- 行动输入的对等的回报( 例如, Q- 值) 直接通过离线的历史隐含的历史反馈反馈反馈反馈反馈反馈反馈? 因此, PL 会算算法 将使用前的预估测算项目 。 。 我们建议的 使用了 使用了 。, 使用了 使用了 使用了 。 输入了一个历史定位工具 。