A long standing goal of the data management community is to develop general, automated systems that ingest semi-structured documents and output queryable tables without human effort or domain specific customization. Given the sheer variety of potential documents, state-of-the art systems make simplifying assumptions and use domain specific training. In this work, we ask whether we can maintain generality by using large language models (LLMs). LLMs, which are pretrained on broad data, can perform diverse downstream tasks simply conditioned on natural language task descriptions. We propose and evaluate EVAPORATE, a simple, prototype system powered by LLMs. We identify two fundamentally different strategies for implementing this system: prompt the LLM to directly extract values from documents or prompt the LLM to synthesize code that performs the extraction. Our evaluations show a cost-quality tradeoff between these two approaches. Code synthesis is cheap, but far less accurate than directly processing each document with the LLM. To improve quality while maintaining low cost, we propose an extended code synthesis implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction. Our key insight is to generate many candidate functions and ensemble their extractions using weak supervision. EVAPORATE-CODE+ not only outperforms the state-of-the art systems, but does so using a sublinear pass over the documents with the LLM. This equates to a 110x reduction in the number of tokens the LLM needs to process, averaged across 16 real-world evaluation settings of 10k documents each.
翻译:长期以来,数据管理社区的一个长远目标是开发通用的、自动化的系统来摄取半结构化文档并输出可查询的表格,不需要人工干预或领域特定的定制。由于潜在的文档种类极其多样,现有最先进的系统采用简化假设,使用领域特定的训练。在这项工作中,我们考察了是否可以通过使用大型自然语言模型(LLM)来保持通用性。预先对广泛数据进行训练的LLM可以根据自然语言任务描述执行各种下游任务。我们提出并评估了一种由LLM驱动的简单原型系统EVAPORATE。我们发现有两种根本不同的策略可以实现这个系统:促使LLM直接从文档中提取值或促使LLM合成执行提取的代码。我们的评估显示了这两个方法之间的成本-质量权衡。代码合成成本较低,但比直接使用LLM处理每个文档的方法要不准确得多。为了在保持低成本的同时提高质量,我们提出了一个扩展代码合成实现,EVAPORATE-CODE+,它比直接提取取得更好的质量。我们的关键研究成果是生成许多候选函数并使用弱监督法对它们的提取进行集成。EVAPORATE-CODE+不仅优于现有最先进的系统,而且使用LLM对文档进行子线性遍历。这相当于在16个每个包含1万篇文档的现实世界评估环境中,通过LLM处理的标记数的平均值减少了110倍。