Learning to capture text-table alignment is essential for tasks like text-to-SQL. A model needs to correctly recognize natural language references to columns and values and to ground them in the given database schema. In this paper, we present a novel weakly supervised Structure-Grounded pretraining framework (StruG) for text-to-SQL that can effectively learn to capture text-table alignment based on a parallel text-table corpus. We identify a set of novel prediction tasks: column grounding, value grounding and column-value mapping, and leverage them to pretrain a text-table encoder. Additionally, to evaluate different methods under more realistic text-table alignment settings, we create a new evaluation set Spider-Realistic based on Spider dev set with explicit mentions of column names removed, and adopt eight existing text-to-SQL datasets for cross-database evaluation. STRUG brings significant improvement over BERT-LARGE in all settings. Compared with existing pretraining methods such as GRAPPA, STRUG achieves similar performance on Spider, and outperforms all baselines on more realistic sets. The Spider-Realistic dataset is available at https://doi.org/10.5281/zenodo.5205322.
翻译:获取文本表对齐对于文本到 SQL 等任务至关重要。 模型需要正确识别自然语言对列和值的引用, 并将其置于给定的数据库校正中。 在本文中, 我们为文本到 SQL 提出了一个新颖的、 监管不力的结构化预培训框架( StruG ), 这个框架可以有效地学习在平行文本表文的基础上获取文本表对齐。 我们确定了一套新颖的预测任务: 列地基、 数值地基和列值绘图, 并将其用于预设文本表编码。 此外, 为了在更现实的文本表校正校正设置下评估不同的方法, 我们根据蜘蛛德维设置的新评价数据集, 明确提及删除的列名, 并采用八种现有的文本到SQL数据集, 用于交叉数据库评价。 STRUG 在所有环境下都大大改进了BERT- LARGE。 与GAPPA、 STRUG 在蜘蛛上实现类似性性能, 和在更符合现实的数据集上超越范围 。