通过基于函数依赖图重排序器的大语言模型高效模式过滤实现Text2SQL的规模化扩展 (Scaling Text2SQL via LLM-efficient Schema Filtering with Functional Dependency Graph Rerankers)

Most modern Text2SQL systems prompt large language models (LLMs) with entire schemas -- mostly column information -- alongside the user's question. While effective on small databases, this approach fails on real-world schemas that exceed LLM context limits, even for commercial models. The recent Spider 2.0 benchmark exemplifies this with hundreds of tables and tens of thousands of columns, where existing systems often break. Current mitigations either rely on costly multi-step prompting pipelines or filter columns by ranking them against user's question independently, ignoring inter-column structure. To scale existing systems, we introduce \toolname, an open-source, LLM-efficient schema filtering framework that compacts Text2SQL prompts by (i) ranking columns with a query-aware LLM encoder enriched with values and metadata, (ii) reranking inter-connected columns via a lightweight graph transformer over functional dependencies, and (iii) selecting a connectivity-preserving sub-schema with a Steiner-tree heuristic. Experiments on real datasets show that \toolname achieves near-perfect recall and higher precision than CodeS, SchemaExP, Qwen rerankers, and embedding retrievers, while maintaining sub-second median latency and scaling to schemas with 23,000+ columns. Our source code is available at https://github.com/thanhdath/grast-sql.

翻译：大多数现代Text2SQL系统会将整个模式（主要是列信息）与用户问题一同输入大型语言模型（LLM）进行提示。虽然这种方法在小型数据库上有效，但对于超出LLM上下文限制的真实世界模式（即使对于商业模型）则会失效。最新的Spider 2.0基准测试以数百个表和数万列为例证，现有系统在此类场景下常出现故障。当前的缓解方案要么依赖昂贵的多步提示流水线，要么通过将列与用户问题独立排序来进行筛选，忽略了列间结构关系。为扩展现有系统规模，我们提出了\\toolname，这是一个开源、LLM高效的模式过滤框架，通过以下方式压缩Text2SQL提示：(i) 使用融合数值和元数据的查询感知LLM编码器对列进行排序，(ii) 通过基于函数依赖的轻量级图变换器对互关联列进行重排序，(iii) 采用斯坦纳树启发式算法选择保持连通性的子模式。在真实数据集上的实验表明，\\toolname在实现接近完美召回率的同时，其精确度优于CodeS、SchemaExP、Qwen重排序器及嵌入检索器，且保持亚秒级中位延迟，可扩展至包含23,000+列的模式。我们的源代码发布于https://github.com/thanhdath/grast-sql。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

RAG与RAU：自然语言处理中的检索增强语言模型综述

专知会员服务

87+阅读 · 2024年5月3日

如何检测大模型“幻觉”？剑桥提出SelfCheckGPT: 针对生成型大型语言模型的零资源黑盒子幻觉检测

专知会员服务

43+阅读 · 2023年8月22日

《用于代码弱点识别的 LLVM 中间表示》CMU

专知会员服务

14+阅读 · 2022年12月12日

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日