Human linguistic capacity is often characterized by compositionality and the generalization it enables -- human learners can produce and comprehend novel complex expressions by composing known parts. Several benchmarks exploit distributional control across training and test to gauge compositional generalization, where certain lexical items only occur in limited contexts during training. While recent work using these benchmarks suggests that pretrained models achieve impressive generalization performance, we argue that exposure to pretraining data may break the aforementioned distributional control. Using the COGS benchmark of Kim and Linzen (2020), we test two modified evaluation setups that control for this issue: (1) substituting context-controlled lexical items with novel character sequences, and (2) substituting them with special tokens represented by novel embeddings. We find that both of these setups lead to lower generalization performance in T5 (Raffel et al., 2020), suggesting that previously reported results have been overestimated due to uncontrolled lexical exposure during pretraining. The performance degradation is more extreme with novel embeddings, and the degradation increases with the amount of pretraining data, highlighting an interesting case of inverse scaling.
翻译:人类语言能力的特点往往是构成性及其促成的概括性 -- -- 人类学习者可以通过收集已知部件来制作和理解新颖的复杂表达方式。若干基准利用各种培训的分布控制,并测试来测量组成性一般化,在培训期间,某些术语只在有限的情况下出现。虽然最近使用这些基准的工作表明,预先培训的模型能够取得令人印象深刻的概括性表现,但我们认为,受培训前数据的影响可能打破上述分布性控制。我们使用Kim和Linzen(202020年)的COGS基准,测试了控制这一问题的两个经过修改的评价设置:(1) 替换了带有新字符序列的受上下文控制的词汇项目,(2) 替换了以新嵌入为代表的特殊符号。我们发现,这两个设置导致T5(Raffel等人,2020年)的概括性表现较低,表明先前所报告的结果由于在培训前的不受控制的词汇性接触而过高。我们用新的嵌入方式测试得更极端,而且随着培训前数据的数量而退化增加,突出了反向缩的例子。