Chain-of-Thought (CoT) prompting can effectively elicit complex multi-step reasoning from Large Language Models~(LLMs). For example, by simply adding CoT instruction ``Let's think step-by-step'' to each input query of MultiArith dataset, GPT-3's accuracy can be improved from 17.7\% to 78.7\%. However, it is not clear whether CoT is still effective on more recent instruction finetuned (IFT) LLMs such as ChatGPT. Surprisingly, on ChatGPT, CoT is no longer effective for certain tasks such as arithmetic reasoning while still keeping effective on other reasoning tasks. Moreover, on the former tasks, ChatGPT usually achieves the best performance and can generate CoT even without being instructed to do so. Hence, it is plausible that ChatGPT has already been trained on these tasks with CoT and thus memorized the instruction so it implicitly follows such an instruction when applied to the same queries, even without CoT. Our analysis reflects a potential risk of overfitting/bias toward instructions introduced in IFT, which becomes more common in training LLMs. In addition, it indicates possible leakage of the pretraining recipe, e.g., one can verify whether a dataset and instruction were used in training ChatGPT. Our experiments report new baseline results of ChatGPT on a variety of reasoning tasks and shed novel insights into LLM's profiling, instruction memorization, and pretraining dataset leakage.
翻译:串联思路(CoT)提示可以有效地引导大语言模型(LLM)进行复杂的多步推理。例如,仅通过在MultiArith数据集的每个输入查询中添加CoT提示“让我们逐步思考”,GPT-3的准确性就可以从17.7%提高到78.7%。然而,目前尚不清楚在更近期的指令微调(IFT)LLM(例如ChatGPT)上,CoT是否仍然有效。令人惊讶的是,在某些任务(如算术推理)上,ChatGPT上的CoT不再有效,同时仍然有效于其他推理任务。此外,对于前者的任务,ChatGPT通常可以实现最佳性能,并且可以在没有明确指示的情况下生成CoT。因此,ChatGPT在这些任务上已经通过CoT进行了训练,因此在应用于相同查询时会自动遵循这样的指示,即使没有CoT。我们的分析反映了IFT中引入的指令过度拟合/偏差的潜在风险,这在训练LLM变得越来越常见时变得更明显。此外,它还表明了预训练配方可能存在泄漏,例如,可以验证数据集和指令是否用于训练ChatGPT。我们的实验报告了ChatGPT在各种推理任务上的新基准结果,并为LLM的专业化、指令记忆和预训练数据集泄漏提供了新的见解。