Unstructured notes within the electronic health record (EHR) contain rich clinical information vital for cancer treatment decision making and research, yet reliably extracting structured oncology data remains challenging due to extensive variability, specialized terminology, and inconsistent document formats. Manual abstraction, although accurate, is prohibitively costly and unscalable. Existing automated approaches typically address narrow scenarios - either using synthetic datasets, restricting focus to document-level extraction, or isolating specific clinical variables (e.g., staging, biomarkers, histology) - and do not adequately handle patient-level synthesis across the large number of clinical documents containing contradictory information. In this study, we propose an agentic framework that systematically decomposes complex oncology data extraction into modular, adaptive tasks. Specifically, we use large language models (LLMs) as reasoning agents, equipped with context-sensitive retrieval and iterative synthesis capabilities, to exhaustively and comprehensively extract structured clinical variables from real-world oncology notes. Evaluated on a large-scale dataset of over 400,000 unstructured clinical notes and scanned PDF reports spanning 2,250 cancer patients, our method achieves an average F1-score of 0.93, with 100 out of 103 oncology-specific clinical variables exceeding 0.85, and critical variables (e.g., biomarkers and medications) surpassing 0.95. Moreover, integration of the agentic system into a data curation workflow resulted in 0.94 direct manual approval rate, significantly reducing annotation costs. To our knowledge, this constitutes the first exhaustive, end-to-end application of LLM-based agents for structured oncology data extraction at scale
翻译:电子健康记录(EHR)中的非结构化笔记包含对癌症治疗决策和研究至关重要的丰富临床信息,然而,由于存在广泛的变异性、专业术语以及不一致的文档格式,可靠地提取结构化肿瘤学数据仍然具有挑战性。人工提取虽然准确,但成本高昂且难以扩展。现有的自动化方法通常只适用于狭窄的场景——要么使用合成数据集,要么将关注点限制在文档级提取,要么孤立地处理特定的临床变量(例如,分期、生物标志物、组织学)——并且无法充分处理包含矛盾信息的大量临床文档中患者层面的信息综合。在本研究中,我们提出了一种智能体框架,将复杂的肿瘤学数据提取任务系统地分解为模块化、自适应的子任务。具体而言,我们使用大语言模型(LLMs)作为推理智能体,配备上下文敏感的检索和迭代综合能力,从真实世界的肿瘤学笔记中详尽且全面地提取结构化临床变量。在一个包含2,250名癌症患者、超过400,000份非结构化临床笔记和扫描PDF报告的大规模数据集上进行评估,我们的方法平均F1分数达到0.93,103个肿瘤学特异性临床变量中有100个超过0.85,关键变量(例如,生物标志物和药物)超过0.95。此外,将该智能体系统集成到数据整理工作流程中,实现了0.94的直接人工批准率,显著降低了标注成本。据我们所知,这是首次大规模、详尽地应用基于LLM的智能体进行端到端的结构化肿瘤学数据提取。