Language models significantly benefit from context tokens, such as prompts or scratchpads. They perform better when prompted with informative instructions, and they acquire new reasoning capabilities by generating a scratch-pad before predicting the final answers. However, they do not \textit{internalize} these performance gains, which disappear when the context tokens are gone. Our work proposes to apply context distillation so that a language model can improve itself by internalizing these gains. Concretely, given a synthetic unlabeled input for the target task, we condition the model on ``[instructions] + [task-input]'' to predict ``[scratch-pad] + [final answer]''; then we fine-tune the same model to predict its own ``[final answer]'' conditioned on the ``[task-input]'', without seeing the ``[instructions]'' or using the ``[scratch-pad]''. We show that context distillation is a general method to train language models, and it can effectively internalize 3 types of training signals. First, it can internalize abstract task instructions and explanations, so we can iteratively update the model parameters with new instructions and overwrite old ones. Second, it can internalize step-by-step reasoning for complex tasks (e.g., 8-digit addition), and such a newly acquired capability proves to be useful for other downstream tasks. Finally, it can internalize concrete training examples, and it outperforms directly learning with gradient descent by 9\% on the SPIDER Text-to-SQL dataset; furthermore, combining context distillation operations can internalize more training examples than the context window size allows.
翻译:语言模型从上下文符号( 如提示或抓图) 中大大获益于上下文符号( 如提示或抓图 ) 。 当用信息指令来提示时, 语言模型效果会更好, 并且它们通过在预测最终答案之前生成一个抓图来获得新的推理能力 。 但是, 它们不会在上下文符号消失时消失这些绩效增益 。 我们的工作建议应用上下文蒸馏, 这样语言模型就可以通过将这些增益内化来改进自己。 具体地, 如果对目标任务有一个合成的无标签输入, 我们将模型设置在 "[ + [task- put]'上, 以预测“ [scrats-pad] + [最后回答] 之前的刮图新推理能力获得新的推理能力 ; 然后我们微调同样的模型来预测它本身的[ 最后的答案] 。 我们的工作建议应用上下文的蒸馏模型来改进它的条件, 使得语言模型能够使用“ [ startments] 或使用“ swead- paddal” 。 我们显示上的背景蒸馏是用于培训语言模型的一般方法来训练内部模型, 这样的内部推理学 。 。 。 它可以在内部推理学 。 。, 可以将内部推理学 3, 它可以使内部推理学 。 。, 它可以做新的内部推理学 。