Pretrained large generative language models have shown great performance on many tasks, but exhibit low compositional generalization abilities. Scaling such models has been shown to improve their performance on various NLP tasks even just by conditioning them on a few examples to solve the task without any fine-tuning (also known as in-context learning). In this work, we look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning. In the ID settings, the demonstrations are from the same split (test or train) that the model is being evaluated on, and in the OOD settings, they are from the other split. We look at how the relative generalization gap of in-context learning evolves as models are scaled up. We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets, CFQ, SCAN and GeoQuery with different number of exemplars, and observe a trend of decreasing relative generalization gap as models are scaled up.
翻译:未经培训的大型基因变异语言模型在许多任务中表现良好,但具有较低的概括性能力。这种模型的扩展表明它们改进了各种非语言语言模型的绩效,即使只是通过在几个例子上调整它们来完成这项任务而不作任何微调(也称为文体学习 ) 。在这项工作中,我们审视了这些模型在语义分解任务和文体学习任务中的分布(ID)和分配外(OOOD)之间表现的差距。在身份识别设置中,演示来自该模型正在被评估的同一分裂(测试或火车),而在OOD设置中,这些模型是来自另一分割的。我们审视的是,随着模型的扩大,文体学习的相对普遍化差距是如何随着模型的扩大而演化的。我们评估了四个模型组,即OTP、BLOOM、DCGen和Cocux等在三个语系分解数据集、CFQ、SCAN和GeoQuery之间的差异,并观察到了随着模型的扩大而相对普遍化差距减少的趋势。