We propose Rec-R1, a general reinforcement learning framework that bridges large language models (LLMs) with recommendation systems through closed-loop optimization. Unlike prompting and supervised fine-tuning (SFT), Rec-R1 directly optimizes LLM generation using feedback from a fixed black-box recommendation model, without relying on synthetic SFT data from proprietary models such as GPT-4o. This avoids the substantial cost and effort required for data distillation. To verify the effectiveness of Rec-R1, we evaluate it on two representative tasks: product search and sequential recommendation. Experimental results demonstrate that Rec-R1 not only consistently outperforms prompting- and SFT-based methods, but also achieves significant gains over strong discriminative baselines, even when used with simple retrievers such as BM25. Moreover, Rec-R1 preserves the general-purpose capabilities of the LLM, unlike SFT, which often impairs instruction-following and reasoning. These findings suggest Rec-R1 as a promising foundation for continual task-specific adaptation without catastrophic forgetting.
翻译:我们提出了Rec-R1,这是一个通用的强化学习框架,通过闭环优化将大语言模型与推荐系统相连接。与提示工程和监督微调不同,Rec-R1直接利用固定黑盒推荐模型的反馈来优化LLM的生成过程,而无需依赖诸如GPT-4o等专有模型生成的合成SFT数据。这避免了数据蒸馏所需的大量成本与精力。为验证Rec-R1的有效性,我们在两个代表性任务上进行了评估:产品搜索和序列推荐。实验结果表明,Rec-R1不仅持续优于基于提示工程和SFT的方法,即使在使用BM25等简单检索器时,也显著超越了强大的判别式基线模型。此外,与常会损害指令遵循和推理能力的SFT不同,Rec-R1保留了LLM的通用能力。这些发现表明,Rec-R1为实现持续的任务特定适应而不发生灾难性遗忘提供了一个有前景的基础。