机器翻译：翻译后的标题: 一个演示足矣，但更多是否会带来麻烦？在环境中学习所需的演示数量翻译后的摘要: 大型语言模型（LLMs）能够通过在环境中学习（ICL）执行复杂的推理，只需提供少量的输入-输出演示（demos），当提供演示的中间推理步骤（“思维链”）时，LLMs 显得更加强大。在 ICL 中，是否需要使用多演示？在本文中，我们研究了在 \cite{wei2022chain} 中的任务的少量演示的 ICL。令人惊讶的是，当使用仅一个随机选择的演示时，我们没有观察到显着的退化。为了研究这种现象，对于每个测试查询，我们将演示分为“正确演示”和“错误演示”。我们的分析揭示了这些被广泛研究的数据集存在固有的偏见：大多数演示对大多数测试查询都是正确的，这解释了仅使用一个随机演示的良好表现。此外，仅使用一个正确演示的 ICL（带和不带 CoT）显着优于大多数先前工作采用的全演示 ICL，这表明了 LLMs 在查找正确演示的难度以及评估偏见数据集等方面的局限性。此外，我们观察到 ICL 使用多演示时的一个反直觉行为，即在提供更多正确演示时其准确率下降（提高）（在提供更多错误演示时）。这表明 ICL 可能会受到演示之间的干扰和它们之间的虚假关系的影响。我们的分析突显了LLMs训练，ICL和基准设计中需要解决的几个基本挑战。 (It Takes One to Tango but More Make Trouble? The Number of Demonstrations Needed for In-Context Learning)

翻译：机器翻译：翻译后的标题: 一个演示足矣，但更多是否会带来麻烦？在环境中学习所需的演示数量翻译后的摘要: 大型语言模型（LLMs）能够通过在环境中学习（ICL）执行复杂的推理，只需提供少量的输入-输出演示（demos），当提供演示的中间推理步骤（“思维链”）时，LLMs 显得更加强大。在 ICL 中，是否需要使用多演示？在本文中，我们研究了在 \cite{wei2022chain} 中的任务的少量演示的 ICL。令人惊讶的是，当使用仅一个随机选择的演示时，我们没有观察到显着的退化。为了研究这种现象，对于每个测试查询，我们将演示分为“正确演示”和“错误演示”。我们的分析揭示了这些被广泛研究的数据集存在固有的偏见：大多数演示对大多数测试查询都是正确的，这解释了仅使用一个随机演示的良好表现。此外，仅使用一个正确演示的 ICL（带和不带 CoT）显着优于大多数先前工作采用的全演示 ICL，这表明了 LLMs 在查找正确演示的难度以及评估偏见数据集等方面的局限性。此外，我们观察到 ICL 使用多演示时的一个反直觉行为，即在提供更多正确演示时其准确率下降（提高）（在提供更多错误演示时）。这表明 ICL 可能会受到演示之间的干扰和它们之间的虚假关系的影响。我们的分析突显了LLMs训练，ICL和基准设计中需要解决的几个基本挑战。

Jiuhai Chen,LiChang Chen,Chen Zhu,Tianyi Zhou

Large language models (LLMs) are capable to perform complex reasoning by in-context learning (ICL) when provided with a few input-output demonstrations (demos) and more powerful when intermediate reasoning steps ("chain of thoughts (CoT)") of the demos are given. Is it necessary to use multi-demo in ICL? In this paper, we study ICL using fewer demos for each test query on the tasks in~\cite{wei2022chain}. Surprisingly, we do not observe significant degradation when using only one randomly chosen demo. To study this phenomenon, for each test query, we categorize demos into "correct demos" leading to the correct answer, and "wrong demos" resulting in wrong answers. Our analysis reveals an inherent bias in those widely studied datasets: most demos are correct for a majority of test queries, which explains the good performance of using one random demo. Moreover, ICL (with and w/o CoT) using only one correct demo significantly outperforms all-demo ICL adopted by most previous works, indicating the weakness of LLMs in finding correct demo(s) for input queries, which is difficult to evaluate on the biased datasets. Furthermore, we observe a counterintuitive behavior of ICL using multi-demo, i.e., its accuracy degrades(improves) when given more correct(wrong) demos. This implies that ICL can be easily misguided by interference among demos and their spurious correlations. Our analyses highlight several fundamental challenges that need to be addressed in LLMs training, ICL, and benchmark design.

翻译：注意事项：根据机器翻译生成的结果，该结果仅供参考。