PokeeResearch：基于强化学习AI反馈与鲁棒推理框架的高效深度研究 (PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold)

Tool-augmented large language models (LLMs) are emerging as deep research agents, systems that decompose complex queries, retrieve external evidence, and synthesize grounded responses. Yet current agents remain limited by shallow retrieval, weak alignment metrics, and brittle tool-use behavior. We introduce PokeeResearch-7B, a 7B-parameter deep research agent built under a unified reinforcement learning framework for robustness, alignment, and scalability. PokeeResearch-7B is trained by an annotation-free Reinforcement Learning from AI Feedback (RLAIF) framework to optimize policies using LLM-based reward signals that capture factual accuracy, citation faithfulness, and instruction adherence. A chain-of-thought-driven multi-call reasoning scaffold further enhances robustness through self-verification and adaptive recovery from tool failures. Among 10 popular deep research benchmarks, PokeeResearch-7B achieves state-of-the-art performance among 7B-scale deep research agents. This highlights that careful reinforcement learning and reasoning design can produce efficient, resilient, and research-grade AI agents. The model and inference code is open-sourced under MIT license at https://github.com/Pokee-AI/PokeeResearchOSS.

翻译：工具增强型大语言模型（LLM）正逐渐发展为深度研究智能体，这类系统能够分解复杂查询、检索外部证据并生成基于事实的响应。然而，当前智能体仍受限于浅层检索、弱对齐指标以及脆弱的工具使用行为。本文提出PokeeResearch-7B——一个基于统一强化学习框架构建的70亿参数深度研究智能体，具备鲁棒性、对齐性和可扩展性。该模型通过免标注的AI反馈强化学习（RLAIF）框架进行训练，利用基于LLM的奖励信号优化策略，这些奖励信号涵盖事实准确性、引用忠实度和指令遵循度。通过思维链驱动的多轮调用推理框架，结合自我验证与工具故障自适应恢复机制，进一步提升了系统的鲁棒性。在10个主流深度研究基准测试中，PokeeResearch-7B在70亿参数规模的深度研究智能体中取得了最先进的性能。这表明精心的强化学习与推理设计能够催生出高效、稳健且具备研究级能力的AI智能体。模型及推理代码已在MIT许可下开源，项目地址为：https://github.com/Pokee-AI/PokeeResearchOSS。