Chain-of-Thought (CoT) prompting can effectively elicit complex multi-step reasoning from Large Language Models~(LLMs). For example, by simply adding CoT instruction ``Let's think step-by-step'' to each input query of MultiArith dataset, GPT-3's accuracy can be improved from 17.7\% to 78.7\%. However, it is not clear whether CoT is still effective on more recent instruction finetuned (IFT) LLMs such as ChatGPT. Surprisingly, on ChatGPT, CoT is no longer effective for certain tasks such as arithmetic reasoning while still keeping effective on other reasoning tasks. Moreover, on the former tasks, ChatGPT usually achieves the best performance and can generate CoT even without being instructed to do so. Hence, it is plausible that ChatGPT has already been trained on these tasks with CoT and thus memorized the instruction so it implicitly follows such an instruction when applied to the same queries, even without CoT. Our analysis reflects a potential risk of overfitting/bias toward instructions introduced in IFT, which becomes more common in training LLMs. In addition, it indicates possible leakage of the pretraining recipe, e.g., one can verify whether a dataset and instruction were used in training ChatGPT. Our experiments report new baseline results of ChatGPT on a variety of reasoning tasks and shed novel insights into LLM's profiling, instruction memorization, and pretraining dataset leakage.
翻译:“连续思考提示”(Chain-of-Thought Prompting)可以有效地激发大型语言模型的复杂多步推理能力。例如,仅仅在MultiArith数据集的每个输入查询中添加CoT指令“让我们逐步思考”就可以将GPT-3的准确度从17.7%提高到78.7%。然而,目前还不清楚CoT对于新近指令微调(IFT)的语言模型 ,比如ChatGPT是否仍然有效。令人惊讶的是,对于某些任务如算术推理,ChatGPT上的CoT不再有效,而对于其他推理任务则仍然有效。此外,在前一种任务中,ChatGPT通常能够达到最佳性能,甚至在没有CoT指令的情况下也能够生成CoT。这表明,ChatGPT在训练时已经通过CoT指令进行了训练,并且在应用于相同查询时会自动遵循这样的指令,即使没有CoT指令。我们的分析反映了IFT中引入的指令存在过拟合/偏差的风险,这在训练LLM时变得越来越常见。另外,这也提示了预训练配方可能出现泄漏问题,例如我们可以验证ChatGPT的训练数据集和指令。我们实验报告了ChatGPT在各种推理任务上的新基准结果,并深入探讨了LLM的特性、指令的记忆化和预训练数据集泄漏。