Current large language models can perform reasonably well on complex tasks that require step-by-step reasoning with few-shot learning. Are these models applying reasoning skills they have learnt during pre-training and reason outside of their training context, or are they simply memorizing their training corpus at finer granularity and have learnt to better understand their context? To tease apart these possibilities, we introduce ALERT, a benchmark and suite of analyses for assessing language models' reasoning ability comparing pre-trained and finetuned models on complex tasks that require reasoning skills to solve. ALERT provides a test bed to asses any language model on fine-grained reasoning skills, which spans over 20 datasets and covers 10 different reasoning skills. We leverage ALERT to further investigate the role of finetuning. With extensive empirical analysis we find that language models learn more reasoning skills such as textual entailment, abductive reasoning, and analogical reasoning during finetuning stage compared to pretraining state. We also find that when language models are finetuned they tend to overfit to the prompt template, which hurts the robustness of models causing generalization problems.
翻译:目前大型语言模型在完成需要逐步推理的复杂任务时,可以以少许的学习来合理地很好地发挥作用。这些模型是应用他们在培训前和训练背景以外的原因学的推理技能,还是仅仅在细微的颗粒上将训练材料混为一文,并学会更好地了解其背景?为了拆开这些可能性,我们引入了ALERT,这是评估语言模型推理能力的基准和一揽子分析,用以比较需要推理技能的复杂任务的预先培训和微调模型。ALERT提供了一个测试床,用来评估精细推理技能的任何语言模型,该模型涵盖20多个数据集,涵盖10种不同的推理技能。我们利用ALERT进一步调查微调的作用。我们通过广泛的实证分析发现,语言模型学习了更多的推理技能,如文字要求、引论和微调阶段与培训前状态相比的模拟推理技巧。我们还发现,当语言模型精细调整时,它们往往过分适应迅速的模板,这伤害了造成普遍问题的模型的稳健性。