DSPO：面向智能搜索与推理的稳定高效策略优化方法 (DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning)

Enhancing LLMs with the ability to actively search external knowledge is crucial for complex and real-world tasks. Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped. To address this, we introduce \textbf{D}ynamic-filter \textbf{S}equence-level \textbf{P}olicy \textbf{O}ptimization (DSPO), an improved RL algorithm designed for robust agent training through sequence-level optimization and dynamic sample filtering. We train our model purely through RL to interleave multi-turn search and reasoning, obviating the need for supervised demonstration data. Across multiple QA benchmarks, our DSPO-trained 7B model improves over a comparable previous work by \textbf{34.1\%}, and even outperforms the 14B model from previous work in complex multihop QA such as HotpotQA by nearly \textbf{9\% relative}, maintaining exceptional training stability.

翻译：增强大型语言模型主动搜索外部知识的能力对于复杂现实任务至关重要。当前方法要么依赖提示激发模型内在的智能体能力，要么在将强化学习应用于复杂交互任务时面临性能瓶颈与训练崩溃问题，未能充分释放其智能体潜能。为此，我们提出动态过滤序列级策略优化方法，这是一种通过序列级优化与动态样本过滤实现鲁棒智能体训练的改进强化学习算法。我们完全通过强化学习训练模型以交错执行多轮搜索与推理，无需监督演示数据。在多个问答基准测试中，经DSPO训练的7B模型相较同类先前工作提升34.1%，在复杂多跳问答任务上甚至以近9%的相对优势超越先前工作的14B模型，同时保持卓越的训练稳定性。