The proliferation of unstructured data poses a fundamental challenge to traditional database interfaces. While Text-to-SQL has democratized access to structured data, it remains incapable of interpreting semantic or multi-modal queries. Concurrently, vector search has emerged as the de facto standard for querying unstructured data, but its integration with SQL-termed VectorSQL-still relies on manual query crafting and lacks standardized evaluation methodologies, creating a significant gap between its potential and practical application. To bridge this fundamental gap, we introduce and formalize Text2VectorSQL, a novel task to establish a unified natural language interface for seamlessly querying both structured and unstructured data. To catalyze research in this new domain, we present a comprehensive foundational ecosystem, including: (1) A scalable and robust pipeline for synthesizing high-quality Text-to-VectorSQL training data. (2) VectorSQLBench, the first large-scale, multi-faceted benchmark for this task, encompassing 12 distinct combinations across three database backends (SQLite, PostgreSQL, ClickHouse) and four data sources (BIRD, Spider, arXiv, Wikipedia). (3) Several novel evaluation metrics designed for more nuanced performance analysis. Extensive experiments not only confirm strong baseline performance with our trained models, but also reveal the recall degradation challenge: the integration of SQL filters with vector search can lead to more pronounced result omissions than in conventional filtered vector search. By defining the core task, delivering the essential data and evaluation infrastructure, and identifying key research challenges, our work lays the essential groundwork to build the next generation of unified and intelligent data interfaces. Our repository is available at https://github.com/OpenDCAI/Text2VectorSQL.
翻译:非结构化数据的激增对传统数据库接口构成了根本性挑战。尽管Text-to-SQL技术已使结构化数据的访问民主化,但其仍无法解析语义或多模态查询。与此同时,向量搜索已成为查询非结构化数据的事实标准,但其与SQL的集成(称为VectorSQL)仍依赖人工查询构建,且缺乏标准化的评估方法,这在其潜力与实际应用之间形成了显著鸿沟。为弥合这一根本性差距,我们提出并形式化Text2VectorSQL这一新任务,旨在建立一个统一的自然语言接口,以无缝查询结构化和非结构化数据。为推动这一新领域的研究,我们构建了一个全面的基础生态系统,包括:(1)用于合成高质量Text-to-VectorSQL训练数据的可扩展且鲁棒的流水线。(2)VectorSQLBench——该任务首个大规模、多维度基准测试,涵盖三种数据库后端(SQLite、PostgreSQL、ClickHouse)与四种数据源(BIRD、Spider、arXiv、Wikipedia)的12种不同组合。(3)专为更精细化性能分析设计的若干新型评估指标。大量实验不仅证实了我们训练模型的强大基线性能,还揭示了召回率衰减挑战:SQL过滤器与向量搜索的集成可能导致比传统过滤向量搜索更显著的结果遗漏。通过定义核心任务、提供关键数据与评估基础设施、并识别主要研究挑战,我们的工作为构建下一代统一智能数据接口奠定了必要基础。项目仓库地址:https://github.com/OpenDCAI/Text2VectorSQL。