Pretrained Language Models (LMs) have demonstrated ability to perform numerical reasoning by extrapolating from a few examples in few-shot settings. However, the extent to which this extrapolation relies on robust reasoning is unclear. In this paper, we investigate how well these models reason with terms that are less frequent in the pretraining data. In particular, we examine the correlations between the model performance on test instances and the frequency of terms from those instances in the pretraining data. We measure the strength of this correlation for a number of GPT-based language models (pretrained on the Pile dataset) on various numerical deduction tasks (e.g., arithmetic and unit conversion). Our results consistently demonstrate that models are more accurate on instances whose terms are more prevalent, in some cases above $70\%$ (absolute) more accurate on the top 10\% frequent terms in comparison to the bottom 10\%. Overall, although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data, and we encourage researchers to take the pretraining data into account when interpreting evaluation results.
翻译:受过训练的语言模型(LMS)通过在几个发件人的环境中从几个例子中推断出一些实例,显示了进行数字推理的能力。然而,这种外推在多大程度上依赖于有力的推理,还不清楚。在本文中,我们调查这些模型与培训前数据中较少出现的术语的理由有多好。特别是,我们研究了试验实例示范性业绩与培训前数据中这些实例的术语的频率之间的相互关系。我们测量了一些基于GPT的语文模型(在Pile数据集培训前)与各种数字扣减任务(例如算术和单位转换)的关联性。我们的结果一贯表明,模型在术语比较普遍的情况下,在10美元以上的前10美元(绝对值)中更为准确。总体而言,虽然LMS在几分数字推理任务上表现很强,但我们的结果提出了在解释评价结果时,有多少模型实际上比培训前数据更为笼统的问题,我们鼓励研究人员将培训前数据考虑在内。