Deep research systems, agentic AI that solve complex, multi-step tasks by coordinating reasoning, search across the open web and user files, and tool use, are moving toward hierarchical deployments with a Planner, Coordinator, and Executors. In practice, training entire stacks end-to-end remains impractical, so most work trains a single planner connected to core tools such as search, browsing, and code. While SFT imparts protocol fidelity, it suffers from imitation and exposure biases and underuses environment feedback. Preference alignment methods such as DPO are schema and proxy-dependent, off-policy, and weak for long-horizon credit assignment and multi-objective trade-offs. A further limitation of SFT and DPO is their reliance on human defined decision points and subskills through schema design and labeled comparisons. Reinforcement learning aligns with closed-loop, tool-interaction research by optimizing trajectory-level policies, enabling exploration, recovery behaviors, and principled credit assignment, and it reduces dependence on such human priors and rater biases. This survey is, to our knowledge, the first dedicated to the RL foundations of deep research systems. It systematizes recent work along three axes: (i) data synthesis and curation; (ii) RL methods for agentic research covering stability, sample efficiency, long context handling, reward and credit design, multi-objective optimization, and multimodal integration; and (iii) agentic RL training systems and frameworks. We also cover agent architecture and coordination, as well as evaluation and benchmarks, including recent QA, VQA, long-form synthesis, and domain-grounded, tool-interaction tasks. We distill recurring patterns, surface infrastructure bottlenecks, and offer practical guidance for training robust, transparent deep research agents with RL.
翻译:深度研究系统——一种通过协调推理、在开放网络和用户文件中进行搜索以及工具使用来解决复杂多步骤任务的智能体AI——正朝着包含规划器、协调器和执行器的分层部署方向发展。在实践中,端到端训练整个系统栈仍不切实际,因此大多数工作训练一个单一的规划器,并连接到搜索、浏览和代码等核心工具。虽然监督微调(SFT)能保证协议保真度,但它存在模仿偏差和暴露偏差,且对环境反馈利用不足。偏好对齐方法(如DPO)依赖于模式设计和代理指标,属于离策略方法,在长时程信用分配和多目标权衡方面表现较弱。SFT和DPO的另一个局限是它们通过模式设计和标注比较依赖人工定义的决策点和子技能。强化学习通过优化轨迹级策略,支持探索、恢复行为和原则性信用分配,与闭环工具交互研究相契合,并减少了对此类人工先验和评分者偏差的依赖。据我们所知,本文是首篇专注于深度研究系统强化学习基础的综述。它从三个维度系统梳理了近期工作:(i)数据合成与整理;(ii)面向智能体研究的强化学习方法,涵盖稳定性、样本效率、长上下文处理、奖励与信用设计、多目标优化和多模态集成;(iii)智能体强化学习训练系统与框架。我们还涉及智能体架构与协调,以及评估与基准测试,包括近期的问答、视觉问答、长文本生成及领域扎根的工具交互任务。我们提炼了重复出现的模式,揭示了基础设施瓶颈,并为使用强化学习训练鲁棒、透明的深度研究智能体提供了实用指导。