Large language models demonstrate an emergent ability to learn a new task from a small number of input-output demonstrations, referred to as in-context few-shot learning. However, recent work shows that in such settings, models mainly learn to mimic the new task distribution, instead of the mechanics of the new task. We argue that the commonly-used evaluation settings of few-shot models utilizing a random selection of in-context demonstrations is not able to disentangle models' ability to learn new skills from demonstrations, as most of the such-selected demonstrations are not informative for prediction beyond exposing the new task's input and output distribution. Therefore, we introduce an evaluation technique that disentangles few-shot learners' gain from in-context learning by picking the demonstrations sharing a specific, informative concept with the predicted sample, in addition to the performance reached by mainly non-informative samples. We find that regardless of the model size, existing few-shot learners are not able to benefit from observing such informative concepts in demonstrations. We also find that such ability may not be obtained trivially by exposing the informative demonstrations in the training process, leaving the challenge of training true in-context learners open.
翻译:大型语言模型显示了从少量投入-产出演示中学习新任务的新任务的能力,这在文中称为 " 文中短片 " 。然而,最近的工作表明,在这种环境中,模型主要学会模仿新的任务分配,而不是新任务的机制。我们争辩说,利用随机选择的文本演示,通常使用的微片模型的评价环境无法分解模型从演示中学习新技能的能力,因为大多数此类选择的演示除了暴露新任务的投入和产出分配之外,对预测来说没有信息。因此,我们采用一种评价技术,通过选择与预测样本分享一个具体、信息化概念的演示,来混淆少数学员从内文学习中得的收益。我们发现,除了主要通过非信息化样本所达到的性能外,现有的微片学习者无法从在演示中观察这种信息化概念中受益。我们还发现,通过在培训过程中揭露信息化的演示可能不会微不足道地获得这种能力,从而在培训过程中留下真正的知识化学习者的挑战。