In-context learning (ICL) enables large language models (LLMs) to perform new tasks by prompting them with a sequence of training examples. However, ICL is very sensitive to the choice of training examples: randomly sampling examples from a training set leads to high variance in performance. In this paper, we show that curating a carefully chosen subset of training data greatly stabilizes ICL performance. We propose two methods to choose training subsets, both of which score training examples individually and then select the highest-scoring ones. CondAcc scores a training example by its average ICL accuracy when combined with random training examples, while Datamodels learns a linear proxy model that estimates how the presence of each training example influences LLM accuracy. On average, CondAcc and Datamodels outperform sampling from the entire training set by 7.7% and 6.3%, respectively, across 5 tasks and two LLMs. Our analysis shows that stable subset examples are no more diverse than average, and are not outliers in terms of sequence length and perplexity.
翻译:文本内学习(ICL) 使大型语言模型(LLMS)能够通过一系列培训实例来推动它们来完成新的任务。然而,ICL对选择培训实例非常敏感:从一组培训中随机抽样示例导致绩效差异很大。在本文中,我们显示,精心选择的培训数据子集极大地稳定了ICL的绩效。我们建议了两种方法来选择培训子集,两种方法都分别得分培训实例,然后选择最分数。 CondAcc在与随机培训实例相结合时,以平均ICL的准确度来评分一个培训示例,而数据模型则学习一个线性代理模型,用来估计每个培训示例的存在如何影响LLM的准确性。平均而言,CondAcc和数据模型在5个任务和2个LMS中分别比7.7%和6.3%的培训中,都比整个培训样本的准确性强。我们的分析表明,稳定的子集示例并不比平均的多,在序列长度和易解度方面没有出处。