Text-to-SQL semantic parsing is an important NLP task, which greatly facilitates the interaction between users and the database and becomes the key component in many human-computer interaction systems. Much recent progress in text-to-SQL has been driven by large-scale datasets, but most of them are centered on English. In this work, we present MultiSpider, the largest multilingual text-to-SQL dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese). Upon MultiSpider, we further identify the lexical and structural challenges of text-to-SQL (caused by specific language properties and dialect sayings) and their intensity across different languages. Experimental results under three typical settings (zero-shot, monolingual and multilingual) reveal a 6.1% absolute drop in accuracy in non-English languages. Qualitative and quantitative analyses are conducted to understand the reason for the performance drop of each language. Besides the dataset, we also propose a simple schema augmentation framework SAVe (Schema-Augmentation-with-Verification), which significantly boosts the overall performance by about 1.8% and closes the 29.5% performance gap across languages.
翻译:文本到 SQL 语义解析是一项重要的 NLP 任务, 大大便利了用户和数据库之间的互动, 并成为许多人- 计算机互动系统的关键组成部分。 文本到 SQL 最近的许多进展是由大规模数据集驱动的, 但大多数都是以英语为中心。 在这项工作中, 我们展示了包含七种语言( 英文、 德文、 法文、 西班牙文、 日文、 中文和越南) 的最大多语种文本到 SQL 数据集。 在多Spider 中, 我们进一步确定了文本到 SQL 的词汇和结构挑战( 由特定语言属性和方言语表达) 及其在不同语言中的强度。 三种典型环境( 零光、 单语和多语种) 的实验结果显示非英语的精确度下降了6. 1% 。 进行了定性和定量分析, 以了解每种语言性能下降的原因。 在多语种中, 除了数据集之外, 我们还提议了一个简单的 SGEVEVE 框架( Schema- aughing- auding- comgration- tracilation) ex the braceal- braceal- palizlexilation