面向非结构化数据处理的多目标智能重写优化 (Multi-Objective Agentic Rewrites for Unstructured Data Processing)

One year ago, we open-sourced DocETL, a declarative system for LLM-powered data processing that, as of November 2025, has 3.2K GitHub stars and users across domains (e.g., journalism, law, medicine, policy, finance, and urban planning). In DocETL, users build pipelines by composing operators described in natural language, also known as semantic operators, with an LLM executing each operator's logic. However, due to complexity in the operator or the data it operates on, LLMs often give inaccurate results. To address this challenge, DocETL introduced rewrite directives, or abstract rules that guide LLM agents in rewriting pipelines by decomposing operators or data. For example, decomposing a single filter("is this email sent from an executive and discussing fraud?") into the conjunction of two separate semantic filters may improve accuracy. However, DocETL only optimizes for accuracy, not cost. How do we optimize for both? We present MOAR (Multi-Objective Agentic Rewrites), a new optimizer for DocETL. To target cost optimization, we introduce two new categories of directives and extend all three existing categories with new ones, bringing the total to over 30 directives -- more than doubling what DocETL originally had. Moreover, since operators can interact with each other unpredictably due to LLM behavior, optimizing operators or sub-pipelines individually can yield suboptimal overall plans. Recognizing this, we design a new global search algorithm that explores rewrites in the context of entire pipelines. Since the space of rewrites is infinite -- pipelines can be rewritten in many ways, and each rewritten pipeline can itself be rewritten -- our algorithm adapts a multi-armed bandit framework to prioritize which pipelines to rewrite. Across six workloads, MOAR achieves 27% higher accuracy than ABACUS, the next-best optimizer, while matching its best accuracy at 55% of its cost.

翻译：一年前，我们开源了DocETL——一个基于大语言模型的声明式数据处理系统。截至2025年11月，该项目已在GitHub上获得3.2K星标，用户覆盖新闻、法律、医学、政策、金融和城市规划等多个领域。在DocETL中，用户通过组合自然语言描述的算子（即语义算子）来构建处理流水线，每个算子的逻辑由大语言模型执行。然而，由于算子本身或其处理数据的复杂性，大语言模型经常产生不准确的结果。为应对这一挑战，DocETL引入了重写指令——即指导智能体通过分解算子或数据来重写流水线的抽象规则。例如，将单一过滤器（"这封邮件是否来自高管并讨论欺诈？"）分解为两个独立语义过滤器的合取操作，可能提升准确性。但DocETL仅针对准确性进行优化，未考虑成本。如何同时优化二者？我们提出了MOAR（多目标智能重写优化器），这是DocETL的新型优化器。为实现成本优化，我们引入两类新指令，并对现有三类指令进行扩展，使指令总数超过30条——较原始版本增加一倍以上。此外，由于大语言模型的行为可能导致算子间产生不可预测的交互，单独优化算子或子流水线可能产生次优的整体方案。基于这一认识，我们设计了一种新的全局搜索算法，在整个流水线上下文中探索重写方案。由于重写空间是无限的——流水线存在多种重写方式，且每次重写后的流水线仍可继续重写——我们的算法采用多臂赌博机框架来优先选择待重写的流水线。在六类工作负载测试中，MOAR相比次优优化器ABACUS实现了27%的准确率提升，同时以仅55%的成本达到了其最佳准确率水平。