We study session-based recommendation scenarios where we want to recommend items to users during sequential interactions to improve their long-term utility. Optimizing a long-term metric is challenging because the learning signal (whether the recommendations achieved their desired goals) is delayed and confounded by other user interactions with the system. Immediately measurable proxies such as clicks can lead to suboptimal recommendations due to misalignment with the long-term metric. Many works have applied episodic reinforcement learning (RL) techniques for session-based recommendation but these methods do not account for policy-induced drift in user intent across sessions. We develop a new batch RL algorithm called Short Horizon Policy Improvement (SHPI) that approximates policy-induced distribution shifts across sessions. By varying the horizon hyper-parameter in SHPI, we recover well-known policy improvement schemes in the RL literature. Empirical results on four recommendation tasks show that SHPI can outperform matrix factorization, offline bandits, and offline RL baselines. We also provide a stable and computationally efficient implementation using weighted regression oracles.
翻译:我们研究基于届会的建议设想方案,我们希望在相继互动期间向用户推荐项目,以提高其长期效用。优化长期衡量方法具有挑战性,因为学习信号(建议是否达到其预期目标)被其他用户与系统的互动延迟和混淆。立即可以测量的代理数据,如点击,可导致与长期指标不匹配的不最佳建议。许多工程对基于届会的建议应用了上下加固学习技术,但这些方法没有考虑到各会间用户因政策而漂移的意图。我们开发了一个新的批量RL算法,称为短地政策改进(SHPI),它近似了政策导致的跨会间分配变化。我们通过改变SHPI的地平线超参数,恢复了RL文献中众所周知的政策改进计划。关于四项建议任务的经验性结果显示,SHPI能够超越矩阵的因子化、离线性强和离线性RL基线。我们还利用加权回归或触角提供稳定和计算高效的实施。