Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose WebThinker, a deep research agent that empowers LRMs to autonomously search the web, navigate among web pages, and draft reports during the reasoning process. WebThinker integrates a Deep Web Explorer module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an Autonomous Think-Search-and-Draft strategy, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an RL-based training strategy via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at https://github.com/RUC-NLPIR/WebThinker.
翻译:大型推理模型(LRMs),如 OpenAI-o1 和 DeepSeek-R1,展现出令人印象深刻的长期推理能力。然而,它们对静态内部知识的依赖限制了其在复杂、知识密集型任务上的表现,并阻碍了其生成需要综合多样化网络信息的全面研究报告的能力。为解决此问题,我们提出了 WebThinker,一个深度研究智能体,它使 LRMs 能够在推理过程中自主搜索网络、在网页间导航并起草报告。WebThinker 集成了一个深度网络探索器模块,使 LRMs 在遇到知识缺口时能够动态地搜索、导航并从网络提取信息。它还采用了一种自主的“思考-搜索-起草”策略,允许模型实时无缝地交织推理、信息收集和报告撰写。为了进一步增强研究工具的利用,我们通过迭代在线直接偏好优化(DPO)引入了一种基于强化学习的训练策略。在复杂推理基准(GPQA、GAIA、WebWalkerQA、HLE)和科学报告生成任务(Glaive)上进行的大量实验表明,WebThinker 显著优于现有方法和强大的专有系统。我们的方法提高了 LRM 在复杂场景下的可靠性和适用性,为构建更强大、更多功能的深度研究系统铺平了道路。代码可在 https://github.com/RUC-NLPIR/WebThinker 获取。