Autonomous data science, from raw data sources to analyst-grade deep research reports, has been a long-standing challenge, and is now becoming feasible with the emergence of powerful large language models (LLMs). Recent workflow-based data agents have shown promising results on specific data tasks but remain fundamentally limited in achieving fully autonomous data science due to their reliance on predefined workflows. In this paper, we introduce DeepAnalyze-8B, the first agentic LLM designed for autonomous data science, capable of automatically completing the end-toend pipeline from data sources to analyst-grade deep research reports. To tackle high-complexity data science tasks, we propose a curriculum-based agentic training paradigm that emulates the learning trajectory of human data scientists, enabling LLMs to progressively acquire and integrate multiple capabilities in real-world environments. We also introduce a data-grounded trajectory synthesis framework that constructs high-quality training data. Through agentic training, DeepAnalyze learns to perform a broad spectrum of data tasks, ranging from data question answering and specialized analytical tasks to open-ended data research. Experiments demonstrate that, with only 8B parameters, DeepAnalyze outperforms previous workflow-based agents built on most advanced proprietary LLMs. The model, code, and training data of DeepAnalyze are open-sourced, paving the way toward autonomous data science.
翻译:从原始数据源到分析师级别的深度研究报告,自主数据科学一直是一个长期存在的挑战。随着强大大型语言模型(LLMs)的出现,这一目标正变得可行。近期基于工作流的数据智能体在特定数据任务上已展现出有希望的结果,但由于其依赖预定义的工作流,在实现完全自主的数据科学方面仍存在根本性限制。本文介绍了DeepAnalyze-8B,这是首个为自主数据科学设计的智能体LLM,能够自动完成从数据源到分析师级别深度研究报告的端到端流程。为应对高复杂度的数据科学任务,我们提出了一种基于课程学习的智能体训练范式,该范式模拟人类数据科学家的学习轨迹,使LLMs能够在真实环境中逐步获取并整合多种能力。我们还引入了一个数据驱动的轨迹合成框架,用于构建高质量的训练数据。通过智能体训练,DeepAnalyze学会了执行广泛的数据任务,范围涵盖数据问答、专业分析任务到开放式数据研究。实验表明,仅拥有80亿参数的DeepAnalyze,其性能优于先前基于最先进专有LLMs构建的工作流式智能体。DeepAnalyze的模型、代码和训练数据均已开源,为通向自主数据科学铺平了道路。