The ability to extrapolate from short problem instances to longer ones is an important form of out-of-distribution generalization in reasoning tasks, and is crucial when learning from datasets where longer problem instances are rare. These include theorem proving, solving quantitative mathematics problems, and reading/summarizing novels. In this paper, we run careful empirical studies exploring the length generalization capabilities of transformer-based language models. We first establish that naively finetuning transformers on length generalization tasks shows significant generalization deficiencies independent of model scale. We then show that combining pretrained large language models' in-context learning abilities with scratchpad prompting (asking the model to output solution steps before producing an answer) results in a dramatic improvement in length generalization. We run careful failure analyses on each of the learning modalities and identify common sources of mistakes that highlight opportunities in equipping language models with the ability to generalize to longer problems.
翻译:从短期问题案例到较长问题案例的外推能力是推理任务中分配外概括化的重要形式,在从长期问题案例少见的数据集中学习时至关重要,其中包括理论验证、定量数学问题和读/总结小说。在本文中,我们进行了仔细的经验研究,探讨变压器语言模型的长度概括化能力。我们首先确定,天真地微调变压器对时间一般化任务所作的超常化工作显示出与模型规模无关的重大一般化缺陷。我们然后表明,将预先训练过的大型语言模型的内流学习能力与刮痕提示(在模型与产出解决方案步骤相结合后才能提出答案)相结合,会大大改进时间的概括化。我们对每一种学习模式都进行了仔细的失败分析,并找出常见的错误来源,这些错误突出了使语言模型具备概括更长期问题的能力的机会。