Large language models (LLMs) are capable to perform complex reasoning by in-context learning (ICL) when provided with a few input-output demonstrations (demos) and more powerful when intermediate reasoning steps ("chain of thoughts (CoT)") of the demos are given. Is it necessary to use multi-demo in ICL? In this paper, we study ICL using fewer demos for each test query on the tasks in~\cite{wei2022chain}. Surprisingly, we do not observe significant degradation when using only one randomly chosen demo. To study this phenomenon, for each test query, we categorize demos into "correct demos" leading to the correct answer, and "wrong demos" resulting in wrong answers. Our analysis reveals an inherent bias in those widely studied datasets: most demos are correct for a majority of test queries, which explains the good performance of using one random demo. Moreover, ICL (with and w/o CoT) using only one correct demo significantly outperforms all-demo ICL adopted by most previous works, indicating the weakness of LLMs in finding correct demo(s) for input queries, which is difficult to evaluate on the biased datasets. Furthermore, we observe a counterintuitive behavior of ICL using multi-demo, i.e., its accuracy degrades(improves) when given more correct(wrong) demos. This implies that ICL can be easily misguided by interference among demos and their spurious correlations. Our analyses highlight several fundamental challenges that need to be addressed in LLMs training, ICL, and benchmark design.
翻译:大型语言模型( LLMS) 能够通过文体内学习( ICL) 进行复杂的推理, 当提供一些输入输出演示( 演示) 时, 当提供演示的中间推理步骤( “ 思维链( COT) ) 时, 大语言模型( LLMS) 能够执行一些输入输出演示( ICL) 来进行复杂的推理 。 在本文中, 大型语言模型( LLLMS) 需要使用多个测试查询的演示来进行复杂的推理 。 令人惊讶的是, 我们仅仅使用一个随机选择的演示( ICL) 时, 我们没有看到显著的降解( ICL ) 。 要研究这个现象, 对于每次测试查询, 我们将演示分为“ 正确的演示“ ”, 导致正确的答案, 并且“ 错误的演示” 。 我们的分析显示这些广泛研究的数据集的内在偏差: 多数的演示都是正确的, 这解释了使用一个随机的演示。 此外, ICLLO( 和 wroud) 只能使用一种直观我们之前的直观的精确度 。</s>