基于语言模型的简单系统用于生成异构数据湖的结构化视图 (Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes)

A long standing goal of the data management community is to develop general, automated systems that ingest semi-structured documents and output queryable tables without human effort or domain specific customization. Given the sheer variety of potential documents, state-of-the art systems make simplifying assumptions and use domain specific training. In this work, we ask whether we can maintain generality by using large language models (LLMs). LLMs, which are pretrained on broad data, can perform diverse downstream tasks simply conditioned on natural language task descriptions. We propose and evaluate EVAPORATE, a simple, prototype system powered by LLMs. We identify two fundamentally different strategies for implementing this system: prompt the LLM to directly extract values from documents or prompt the LLM to synthesize code that performs the extraction. Our evaluations show a cost-quality tradeoff between these two approaches. Code synthesis is cheap, but far less accurate than directly processing each document with the LLM. To improve quality while maintaining low cost, we propose an extended code synthesis implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction. Our key insight is to generate many candidate functions and ensemble their extractions using weak supervision. EVAPORATE-CODE+ not only outperforms the state-of-the art systems, but does so using a sublinear pass over the documents with the LLM. This equates to a 110x reduction in the number of tokens the LLM needs to process, averaged across 16 real-world evaluation settings of 10k documents each.

翻译：数据管理界的一个长期目标是开发通用的、自动化的系统，可以不需要人的努力或领域特定的定制，摄入半结构化文档并输出可查询的表格。考虑到潜在文档的种类繁多，最先进的系统采用简化的假设和领域特定的训练。在本研究中，我们问是否可以通过使用大型语言模型（LLMs）来保持通用性。LLMs预先在广泛的数据上进行训练，可以在自然语言任务描述的条件下执行各种下游任务。我们提出并评估了一个由LLMs驱动的简单原型系统EVAPORATE。我们确定了两种基本不同的实现策略：提示LLMs直接从文档中提取值或提示LLMs合成执行提取的代码。我们的评估显示了这两种方法之间的成本-质量权衡。代码合成廉价，但比直接使用LLMs处理每个文档要不准确得多。为了在保持低成本的同时提高质量，我们提出了一个扩展的代码合成实现EVAPORATE-CODE+，它在直接提取的基础上实现了更好的质量。我们的关键见解是生成许多候选函数，并使用弱监督集成它们的提取。EVAPORATE-CODE+不仅优于最先进的系统，而且使用LLMs在文档上进行亚线性（sublinear-pass）处理。这相当于在每个由10k个文档组成的16个真实世界评估设置中，LLMs需要处理的令牌数量降低了110倍。