Despite their strong performance on many tasks, pre-trained language models have been shown to struggle on out-of-distribution compositional generalization. Meanwhile, recent work has shown considerable improvements on many NLP tasks from model scaling. Can scaling up model size also improve compositional generalization in semantic parsing? We evaluate encoder-decoder models up to 11B parameters and decoder-only models up to 540B parameters, and compare model scaling curves for three different methods for applying a pre-trained language model to a new task: fine-tuning all parameters, prompt tuning, and in-context learning. We observe that fine-tuning generally has flat or negative scaling curves on out-of-distribution compositional generalization in semantic parsing evaluations. In-context learning has positive scaling curves, but is generally outperformed by much smaller fine-tuned models. Prompt-tuning can outperform fine-tuning, suggesting further potential improvements from scaling as it exhibits a more positive scaling curve. Additionally, we identify several error trends that vary with model scale. For example, larger models are generally better at modeling the syntax of the output space, but are also more prone to certain types of overfitting. Overall, our study highlights limitations of current techniques for effectively leveraging model scale for compositional generalization, while our analysis also suggests promising directions for future work.
翻译:尽管在很多任务上表现良好,但经过培训的语文模型已经表明,在分配范围外的概括性方面很难做到。与此同时,最近的工作表明,从模型规模的扩大中,许多NLP任务有了相当大的改进。能够扩大模型规模,还可以改善语义分解中的拼写性概括性吗?我们评价了最多达11B参数的编码脱coder模型和最多达540B参数的分解器模型,比较了三个不同方法的模型缩放曲线,以应用经过培训的语言模型到一项新的任务:微调所有参数,迅速调整,以及文本学习。我们发现,微调一般而言,在分配范围划分的概括性评价中,对分配范围外的缩放性平或负缩放曲线一般。我们通过文义学习,可以正面的缩放曲线,但通常比小得多的精细微模型的缩放速度要快得多,因为从缩放到更积极的缩放曲线。此外,我们发现一些与模型规模不同的错误趋势。例如,较大的模型一般的缩放模型在模拟性分析中比较,同时也显示我们总的缩放式的缩放性产出的缩略。