Data augmentation has attracted a lot of research attention in the deep learning era for its ability in alleviating data sparseness. The lack of labeled data for unseen evaluation databases is exactly the major challenge for cross-domain text-to-SQL parsing. Previous works either require human intervention to guarantee the quality of generated data, or fail to handle complex SQL queries. This paper presents a simple yet effective data augmentation framework. First, given a database, we automatically produce a large number of SQL queries based on an abstract syntax tree grammar. For better distribution matching, we require that at least 80% of SQL patterns in the training data are covered by generated queries. Second, we propose a hierarchical SQL-to-question generation model to obtain high-quality natural language questions, which is the major contribution of this work. Finally, we design a simple sampling strategy that can greatly improve training efficiency given large amounts of generated data. Experiments on three cross-domain datasets, i.e., WikiSQL and Spider in English, and DuSQL in Chinese, show that our proposed data augmentation framework can consistently improve performance over strong baselines, and the hierarchical generation component is the key for the improvement.
翻译:在深学习时代,数据扩增因其在减少数据稀少方面的能力而吸引了大量的研究关注。对于跨域文本到 SQL 的剖析而言,缺少隐蔽评价数据库的标签数据恰恰是主要的挑战。以前的工作要么需要人类干预来保证生成数据的质量,要么无法处理复杂的 SQL 查询。本文提出了一个简单而有效的数据扩增框架。首先,根据一个数据库,我们自动产生大量基于抽象语法树语法的SQL查询。为了更好地匹配分布,我们要求培训数据中至少80%的 SQL 模式由生成的查询所覆盖。第二,我们提出一个SQL到问题生成等级的生成模型,以获得高质量的自然语言问题,这是这项工作的主要贡献。最后,我们设计了一个简单的抽样战略,在生成大量数据的情况下可以大大提高培训效率。在三个跨域数据集上进行实验,即英语的WikisQL和蜘蛛以及中文的DusQL,我们提出的数据扩增等级框架能够持续改进强的基线和关键生成的等级框架。