Scaling reinforcement learning (RL) to recommender systems (RS) is promising since maximizing the expected cumulative rewards for RL agents meets the objective of RS, i.e., improving customers' long-term satisfaction. A key approach to this goal is offline RL, which aims to learn policies from logged data. However, the high-dimensional action space and the non-stationary dynamics in commercial RS intensify distributional shift issues, making it challenging to apply offline RL methods to RS. To alleviate the action distribution shift problem in extracting RL policy from static trajectories, we propose Value Penalized Q-learning (VPQ), an uncertainty-based offline RL algorithm. It penalizes the unstable Q-values in the regression target by uncertainty-aware weights, without the need to estimate the behavior policy, suitable for RS with a large number of items. We derive the penalty weights from the variances across an ensemble of Q-functions. To alleviate distributional shift issues at test time, we further introduce the critic framework to integrate the proposed method with classic RS models. Extensive experiments conducted on two real-world datasets show that the proposed method could serve as a gain plugin for existing RS models.
翻译:强化强化学习(RL)到建议系统(RS)是大有希望的,因为最大限度地提高RL代理商预期累积回报率符合RS的目标,即提高客户的长期满意度。实现这一目标的一个关键办法是离线RL,目的是从登录数据中学习政策。不过,商业RS的高度行动空间和非静止动态强化了分布性转移问题,使得将离线RL方法应用于RS具有挑战性。为了减轻从静态轨迹中提取RL政策时的行动分配转移问题,我们提议了数值惩罚性Q-学习(VPQ),这是一个基于不确定性的离线RL算法。这个目标的关键办法是用不确定性-觉悟加权法惩罚回归目标中不稳定的Q-价值,而不必估计行为政策,而适应大量物品的RS。我们从一系列Q-功能的差异中得出惩罚权重。为了在测试时缓解分配性转移问题,我们进一步引入批评性框架,以便将拟议的方法与经典RS-学习(VPQ)模型(VPQ)整合,这是基于不确定性的离线RL算法,它用不确定性的重量来惩罚性回归目标的不稳定-价值,而不必估计行为政策。我们从两个实际世界数据模型进行广泛的实验,以显示现有数据配置模型。