EasyTUS：面向数据湖中快速精准表联合搜索的综合性框架 (EasyTUS: A Comprehensive Framework for Fast and Accurate Table Union Search across Data Lakes)

from arxiv, Copyright 2025 IEEE. This is the author's version of the work that has been accepted for publication in Proceedings of the IEEE International Conference on Big Data (IEEE BigData 2025). The final version of record is available at: tba

Data lakes enable easy maintenance of heterogeneous data in its native form. While this flexibility can accelerate data ingestion, it shifts the complexity of data preparation and query processing to data discovery tasks. One such task is Table Union Search (TUS), which identifies tables that can be unioned with a given input table. In this work, we present EasyTUS, a comprehensive framework that leverages Large Language Models (LLMs) to perform efficient and scalable Table Union Search across data lakes. EasyTUS implements the search pipeline as three modular steps: Table Serialization for consistent formatting and sampling, Table Representation that utilizes LLMs to generate embeddings, and Vector Search that leverages approximate nearest neighbor indexing for semantic matching. To enable reproducible and systematic evaluation, in this paper, we also introduce TUSBench, a novel standardized benchmarking environment within the EasyTUS framework. TUSBench supports unified comparisons across approaches and data lakes, promoting transparency and progress in the field. Our experiments using TUSBench show that EasyTUS consistently outperforms most of the state-of the-art approaches, achieving improvements in average of up to 34.3% in Mean Average Precision (MAP), up to 79.2x speedup in data preparation, and up to 7.7x faster query processing performance. Furthermore, EasyTUS maintains strong performance even in metadata-absent settings, highlighting its robustness and adaptability across data lakes.

翻译：数据湖支持以原始形式便捷地维护异构数据。尽管这种灵活性可以加速数据摄入，但它将数据准备和查询处理的复杂性转移到了数据发现任务上。表联合搜索（Table Union Search, TUS）便是此类任务之一，其旨在识别可与给定输入表进行联合操作的表。在本工作中，我们提出了EasyTUS，一个利用大型语言模型（Large Language Models, LLMs）在数据湖中执行高效、可扩展表联合搜索的综合性框架。EasyTUS将搜索流程实现为三个模块化步骤：用于一致格式化和采样的表序列化（Table Serialization）、利用LLMs生成嵌入的表表示（Table Representation），以及利用近似最近邻索引进行语义匹配的向量搜索（Vector Search）。为实现可复现和系统化的评估，本文还在EasyTUS框架内引入了TUSBench，一个新颖的标准化基准测试环境。TUSBench支持跨方法和数据湖的统一比较，以促进该领域的透明度和进展。我们使用TUSBench进行的实验表明，EasyTUS在多数最先进方法中持续表现优异，在平均精度均值（Mean Average Precision, MAP）上平均提升高达34.3%，数据准备速度提升高达79.2倍，查询处理性能提升高达7.7倍。此外，即使在元数据缺失的场景下，EasyTUS仍保持强劲性能，突显了其在不同数据湖间的鲁棒性和适应性。