Compositionality, or the ability to combine familiar units like words into novel phrases and sentences, has been the focus of intense interest in artificial intelligence in recent years. To test compositional generalization in semantic parsing, Keysers et al. (2020) introduced Compositional Freebase Queries (CFQ). This dataset maximizes the similarity between the test and train distributions over primitive units, like words, while maximizing the compound divergence: the dissimilarity between test and train distributions over larger structures, like phrases. Dependency parsing, however, lacks a compositional generalization benchmark. In this work, we introduce a gold-standard set of dependency parses for CFQ, and use this to analyze the behavior of a state-of-the art dependency parser (Qi et al., 2020) on the CFQ dataset. We find that increasing compound divergence degrades dependency parsing performance, although not as dramatically as semantic parsing performance. Additionally, we find the performance of the dependency parser does not uniformly degrade relative to compound divergence, and the parser performs differently on different splits with the same compound divergence. We explore a number of hypotheses for what causes the non-uniform degradation in dependency parsing performance, and identify a number of syntactic structures that drive the dependency parser's lower performance on the most challenging splits.
翻译:合成性, 或将语言等熟悉单位整合为新词句和句子的能力, 是近年来人工智能中引起极大兴趣的重点。 为测试语义分析的构成概观, Keysers 等人 (2020年) 引入了“ 集成自由基础分析 ” ( CFQ ) 。 此数据集将原始单位的测试与培训分布的相似性最大化, 如字词, 并同时将复合差异最大化: 测试与培训分布在较大结构( 如语句和句子) 上的差异性能不同。 但是, 依赖性对合成的概括性能缺乏一个基准。 在这项工作中, 我们引入了一套CFCQ 的金标准依赖性剖析, 并用这套标准来分析“ 集成自由基础分析” (Qi 等人, 2020年) 。 我们发现, 日益扩大的复合性差分化会降低依赖性差, 尽管不及语义性能对比性能显著。 此外, 我们发现, 依赖性分析师的性能不会与复合差异相对, 并且是造成不同性变化的。