探索以进化：通过主动在线探索为深度研究智能体扩展进化聚合逻辑 (Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents)

Deep research web agents not only retrieve information from diverse sources such as web environments, files, and multimodal inputs, but more importantly, they need to rigorously analyze and aggregate knowledge for insightful research. However, existing open-source deep research agents predominantly focus on enhancing information-seeking capabilities of web agents to locate specific information, while overlooking the essential need for information aggregation, which would limit their ability to support in-depth research. We propose an Explore to Evolve paradigm to scalably construct verifiable training data for web agents. Begins with proactive online exploration, an agent sources grounded information by exploring the real web. Using the collected evidence, the agent then self-evolves an aggregation program by selecting, composing, and refining operations from 12 high-level logical types to synthesize a verifiable QA pair. This evolution from high-level guidance to concrete operations allowed us to scalably produce WebAggregatorQA, a dataset of 10K samples across 50K websites and 11 domains. Based on an open-source agent framework, SmolAgents, we collect supervised fine-tuning trajectories to develop a series of foundation models, WebAggregator. WebAggregator-8B matches the performance of GPT-4.1, while the 32B variant surpasses GPT-4.1 by more than 10% on GAIA-text and closely approaches Claude-3.7-sonnet. Moreover, given the limited availability of benchmarks that evaluate web agents' information aggregation abilities, we construct a human-annotated evaluation split of WebAggregatorQA as a challenging test set. On this benchmark, Claude-3.7-sonnet only achieves 28%, and GPT-4.1 scores 25.8%. Even when agents manage to retrieve all references, they still struggle on WebAggregatorQA, highlighting the need to strengthen the information aggregation capabilities of web agent foundations.

翻译：深度研究网络智能体不仅需要从网络环境、文件及多模态输入等多种来源检索信息，更重要的是，它们必须对知识进行严谨分析与聚合以产生深刻的研究见解。然而，现有的开源深度研究智能体主要侧重于增强网络智能体定位特定信息的信息检索能力，却忽视了信息聚合这一核心需求，这将限制其支持深度研究的能力。我们提出一种“探索以进化”的范式，以可扩展的方式为网络智能体构建可验证的训练数据。该范式始于主动在线探索，智能体通过探索真实网络来获取有依据的信息。利用收集到的证据，智能体随后通过从12种高级逻辑类型中选择、组合并精炼操作，自我进化出一个聚合程序，从而合成一个可验证的问答对。这种从高级指导到具体操作的进化过程使我们能够可扩展地生成WebAggregatorQA数据集，该数据集包含跨越5万个网站和11个领域的1万个样本。基于开源智能体框架SmolAgents，我们收集监督微调轨迹，开发了一系列基础模型WebAggregator。WebAggregator-8B模型的性能与GPT-4.1相当，而其32B变体在GAIA-text基准上超越了GPT-4.1超过10%，并接近Claude-3.7-sonnet的水平。此外，鉴于目前评估网络智能体信息聚合能力的基准数据集有限，我们构建了WebAggregatorQA的人工标注评估子集作为一个具有挑战性的测试集。在此基准上，Claude-3.7-sonnet仅达到28%，GPT-4.1得分为25.8%。即使智能体成功检索到所有参考文献，它们在WebAggregatorQA上仍然表现不佳，这凸显了加强网络智能体基础模型信息聚合能力的迫切需求。