Synthesizing data for semantic parsing has gained increasing attention recently. However, most methods require handcrafted (high-precision) rules in their generative process, hindering the exploration of diverse unseen data. In this work, we propose a generative model which features a (non-neural) PCFG that models the composition of programs (e.g., SQL), and a BART-based translation model that maps a program to an utterance. Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand. Moreover, explicitly modeling compositions using PCFG leads to a better exploration of unseen programs, thus generate more diverse data. We evaluate our method in both in-domain and out-of-domain settings of text-to-SQL parsing on the standard benchmarks of GeoQuery and Spider, respectively. Our empirical results show that the synthesized data generated from our model can substantially help a semantic parser achieve better compositional and domain generalization.
翻译:合成语义解析数据的合成数据最近受到越来越多的关注。 但是,大多数方法都需要手工制作(高精度)的基因化规则,这阻碍了对各种不可见数据的探索。在这项工作中,我们提出了一个基因模型,以(非神经的) PCFG 模型为特点,模型可以模拟程序的组成(例如SQL),以及基于BART的翻译模型,将程序映射成一个语义。由于PCFG和预先培训的BART的简单性,我们从手头的现有数据中可以有效地学习到我们的基因化模型。此外,使用PCFG 的清晰建模有助于更好地探索各种不可见的程序,从而产生更多样化的数据。我们评估了我们分别在地理Query和蜘蛛标准基准上的文本到 SQL 模型内部和外部设置中的方法。我们的经验结果表明,从我们模型中生成的合成数据可以极大地帮助语义分解器实现更好的组成和域化。