Large language models encode surprisingly broad knowledge about the world into their parameters. However, the knowledge in static language models can fall out of date, limiting the model's effective "shelf life." While online fine-tuning can reduce this degradation, we find that fine-tuning on a stream of documents using standard optimizers such as Adam leads to a disappointingly low level of information uptake. We hypothesize that online fine-tuning does not sufficiently 'attend' to important information. That is, the gradient signal from important tokens representing factual information is drowned out by the gradient from inherently noisy tokens, suggesting a dynamic, context-aware learning rate may be beneficial. To test this hypothesis, we meta-train a small, autoregressive model to reweight the language modeling loss for each token during online fine-tuning, with the objective of maximizing the out-of-date base language model's ability to answer questions about a document after a single weighted gradient step. We call this approach Context-aware Meta-learned Loss Scaling (CaMeLS). Across three different distributions of documents, our experiments find that fine-tuning on streams of thousands of documents with CaMeLS substantially improves knowledge retention compared to standard online fine-tuning. Finally, we find that the meta-learned weights are general, and that a single reweighting model can be used to enhance the online adaptation of many LMs.
翻译:暂无翻译