Neural text-to-SQL models have achieved remarkable performance in translating natural language questions into SQL queries. However, recent studies reveal that text-to-SQL models are vulnerable to task-specific perturbations. Previous curated robustness test sets usually focus on individual phenomena. In this paper, we propose a comprehensive robustness benchmark based on Spider, a cross-domain text-to-SQL benchmark, to diagnose the model robustness. We design 17 perturbations on databases, natural language questions, and SQL queries to measure the robustness from different angles. In order to collect more diversified natural question perturbations, we utilize large pretrained language models (PLMs) to simulate human behaviors in creating natural questions. We conduct a diagnostic study of the state-of-the-art models on the robustness set. Experimental results reveal that even the most robust model suffers from a 14.0% performance drop overall and a 50.7% performance drop on the most challenging perturbation. We also present a breakdown analysis regarding text-to-SQL model designs and provide insights for improving model robustness.
翻译:将自然语言问题转换成 SQL 查询的神经文本到 SQL 模型在将自然语言问题转换成 SQL 查询方面取得了显著的成绩。 但是,最近的研究显示,文本到 SQL 模型很容易受到任务性扰动的影响。 先前的成熟度测试组通常侧重于单个现象。 在本文中,我们提出了一个基于蜘蛛的全面稳健性基准,这是一个跨域文本到 SQL 的基准,以诊断模型的稳健性。 我们设计了17个数据库、自然语言问题和SQL 查询的扰动,以测量不同角度的稳健性。 为了收集更多样化的自然问题扰动,我们利用大型预先培训的语言模型模拟人类行为,以生成自然问题。 我们对强健性集的先进性模型进行诊断性研究。 实验结果显示,即使是最强大的模型也存在14.0%的性能总体下降和最具挑战性的扰动性性性能下降50.7%的情况。 我们还对文本到 SQL 模型设计进行分解分析,并为改进模型的稳健性提供洞察力。