Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks, sparking growing interest in their application to preference reasoning in recommendation systems. Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data. However, these methods face significant practical limitations due to (1) the difficulty of obtaining high-quality CoT data in recommendation and (2) the high inference latency caused by generating CoT reasoning. In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning. This approach eliminates the need for explicit CoT generation and improves inference efficiency, as few latent tokens can effectively capture the entire reasoning process. Building on this idea, we propose \textit{\underline{R}einforced \underline{Latent} \underline{R}easoning for \underline{R}ecommendation} (LatentR$^3$), a novel end-to-end training framework that leverages reinforcement learning (RL) to optimize latent reasoning without relying on any CoT data. LatentR$^3$ adopts a two-stage training strategy: first, supervised fine-tuning to initialize the latent reasoning module, followed by pure RL training to encourage exploration through a rule-based reward design. Our RL implementation is based on a modified GRPO algorithm, which reduces computational overhead during training and introduces continuous reward signals for more efficient learning. Extensive experiments demonstrate that LatentR$^3$ enables effective latent reasoning without any direct supervision of the reasoning process, significantly improving performance when integrated with different LLM-based recommendation methods. Our codes are available at https://github.com/xuwenxinedu/R3.
翻译:大语言模型(LLMs)在复杂问题解决任务中展现出卓越的推理能力,这激发了将其应用于推荐系统中偏好推理的广泛兴趣。现有方法通常依赖显式思维链(CoT)数据进行微调。然而,这些方法面临显著的实践限制,原因在于:(1)在推荐领域难以获取高质量的CoT数据;(2)生成CoT推理过程导致的高推理延迟。本研究探索了一种替代方案,将显式CoT推理转向紧凑且信息密集的隐式推理。该方法无需生成显式CoT,并通过少量隐式标记有效捕捉完整推理过程,从而提升推理效率。基于此思想,我们提出\textit{\underline{基于强化隐式推理的推荐框架}}(LatentR$^3$),这是一种新颖的端到端训练框架,利用强化学习(RL)优化隐式推理过程,且完全不依赖任何CoT数据。LatentR$^3$采用两阶段训练策略:首先通过监督微调初始化隐式推理模块,随后进行纯强化学习训练,并基于规则设计的奖励机制鼓励探索。我们的强化学习实现基于改进的GRPO算法,该算法降低了训练过程中的计算开销,并通过连续奖励信号实现更高效的学习。大量实验表明,LatentR$^3$能够在无需对推理过程进行直接监督的情况下实现有效的隐式推理,当与不同基于大语言模型的推荐方法结合时,性能得到显著提升。代码已开源:https://github.com/xuwenxinedu/R3。