Nowadays, the explosion of unstructured data presents immense analytical value. Leveraging the remarkable capability of large language models (LLMs) in extracting attributes of structured tables from unstructured data, researchers are developing LLM-powered data systems for users to analyze unstructured documents as working with a database. These unstructured data analysis (UDA) systems differ significantly in all aspects, including query interfaces, query optimization strategies, and operator implementations, making it unclear which performs best in which scenario. Unfortunately, there does not exist a comprehensive benchmark that offers high-quality, large-volume, and diverse datasets as well as rich query workload to thoroughly evaluate such systems. To fill this gap, we present UDA-Bench, the first benchmark for unstructured data analysis that meets all the above requirements. Specifically, we organize a team with 30 graduate students that spends over in total 10,000 hours on curating 5 datasets from various domains and constructing a relational database view from these datasets by manual annotation. These relational databases can be used as ground truth to evaluate any of these UDA systems despite their differences in programming interfaces. Moreover, we design diverse queries to analyze the attributes defined in the database schema, covering different types of analytical operators with varying selectivities and complexities. We conduct in-depth analysis of the key building blocks of existing UDA systems: query interface, query optimization, operator design, and data processing. We run exhaustive experiments over the benchmark to fully evaluate these systems and different techniques w.r.t. the above building blocks.
翻译:当前,无结构化数据的爆炸式增长蕴含着巨大的分析价值。借助大语言模型(LLMs)从无结构化数据中提取结构化表格属性的卓越能力,研究人员正在开发基于LLM的数据系统,使用户能够像操作数据库一样分析无结构化文档。这些无结构化数据分析(UDA)系统在查询接口、查询优化策略和算子实现等各方面存在显著差异,导致难以明确何种系统在何种场景下表现最优。遗憾的是,目前缺乏一个能够提供高质量、大规模、多样化数据集以及丰富查询工作负载的综合性基准,以全面评估此类系统。为填补这一空白,我们提出了UDA-Bench,这是首个满足上述所有要求的无结构化数据分析基准。具体而言,我们组建了一个由30名研究生组成的团队,累计投入超过10,000小时,从多个领域精心整理了5个数据集,并通过人工标注构建了这些数据集的关系数据库视图。这些关系数据库可作为真实基准,用于评估任何UDA系统,无论其编程接口如何差异。此外,我们设计了多样化的查询来分析数据库模式中定义的属性,涵盖具有不同选择性和复杂度的各类分析算子。我们对现有UDA系统的关键构建模块进行了深入分析:查询接口、查询优化、算子设计和数据处理。我们在该基准上进行了详尽的实验,以全面评估这些系统及其在上述构建模块中的不同技术。