Modern semantic parsers suffer from two principal limitations. First, training requires expensive collection of utterance-program pairs. Second, semantic parsers fail to generalize at test time to new compositions/structures that have not been observed during training. Recent research has shown that automatic generation of synthetic utterance-program pairs can alleviate the first problem, but its potential for the second has thus far been under-explored. In this work, we investigate automatic generation of synthetic utterance-program pairs for improving compositional generalization in semantic parsing. Given a small training set of annotated examples and an "infinite" pool of synthetic examples, we select a subset of synthetic examples that are structurally-diverse and use them to improve compositional generalization. We evaluate our approach on a new split of the schema2QA dataset, and show that it leads to dramatic improvements in compositional generalization as well as moderate improvements in the traditional i.i.d setup. Moreover, structurally-diverse sampling achieves these improvements with as few as 5K examples, compared to 1M examples when sampling uniformly at random -- a 200x improvement in data efficiency.
翻译:现代语义剖析器受到两个主要限制。 首先,培训需要昂贵的语音-方案配对收集费用。 其次,语义剖析器在测试时未能在测试时将测试时的精密拼写成在培训期间没有观察到的新组成/结构。 最近的研究表明,自动生成合成话-方案配对可以缓解第一个问题,但对于第二个问题,其潜力迄今尚未得到充分探讨。 在这项工作中,我们调查合成话-方案配对的自动生成,以改进语义解析中的拼写概括化。由于有附加说明的例子和“无限”合成例子集合的少量培训组合,我们选择了一组结构多样化的合成例子,并利用它们来改进拼写性概括化。 我们评估了我们关于新分割 schema2QA 数据集的方法, 并表明它导致构成性概括化的大幅改进以及传统i. d 设置的适度改进。 此外,结构多样性取样取得了这些改进,只有5K的例子,比1Mx的标准化数据标准化地改进了1Mx。