Generating step-by-step "chain-of-thought" rationales improves language model performance on complex reasoning tasks like mathematics or commonsense question-answering. However, inducing language model rationale generation currently requires either constructing massive rationale datasets or sacrificing accuracy by using only few-shot inference. We propose a technique to iteratively leverage a small number of rationale examples and a large dataset without rationales, to bootstrap the ability to perform successively more complex reasoning. This technique, the "Self-Taught Reasoner" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers, and performs comparably to fine-tuning a 30$\times$ larger state-of-the-art language model on CommensenseQA. Thus, STaR lets a model improve itself by learning from its own generated reasoning.
翻译:在数学或常识问答等复杂推理任务中,逐步产生“思维链”的原理改进了语言模型的功能。然而,目前,引导语言模型的原理生成需要要么构建大规模的理由数据集,要么仅使用几分推论而牺牲准确性。我们建议一种技术,即反复利用少量的理由实例和无理由的大数据集,以强化连续执行更复杂推理的能力。这种技术,即“自学理性”(STaR),依靠简单的循环:产生回答许多问题的理由,并用几个理由实例提出;如果产生的答案是错误的,则再次试图根据正确答案产生理由;对所有最终得出正确答案的理由进行微调;重复。我们表明STaR大大改进了多个数据集的性能,而模型则对直接预测最终答案进行微调,并进行可比较的调整30美元更大的州际语言模型。因此,STaR通过学习自身推理来改进模型。