Large language models (LLM) trained using the next-token-prediction objective, such as GPT3 and PaLM, have revolutionized natural language processing in recent years by showing impressive zero-shot and few-shot capabilities across a wide range of tasks. In this work, we propose a simple technique that significantly boosts the performance of LLMs without adding computational cost. Our key observation is that, by performing the next token prediction task with randomly selected past tokens masked out, we can improve the quality of the learned representations for downstream language understanding tasks. We hypothesize that randomly masking past tokens prevents over-attending to recent tokens and encourages attention to tokens in the distant past. We find that our method, Forgetful Causal Masking (FCM), significantly improves both few-shot and finetuning performance of PaLM. We further consider a simple extension, T-FCM, which introduces bidirectional context to causal language model without altering the sequence order, and further improves finetuning performance.
翻译:使用下端口令目标(如GPT3和PALM)培训的大型语言模型(LLM)近年来通过在一系列任务中显示令人印象深刻的零射和几射能力,使自然语言处理发生了革命性的变化。在这项工作中,我们提出了一个简单的方法,在不增加计算成本的情况下大大提升了LLM的性能。我们的主要观察是,通过以随机选择的过去代号来完成下一个象征性的预测任务,我们可以提高为下游语言理解任务所学习的演示的质量。我们假设,随机遮盖过去代号防止过度使用最近的代号,并鼓励注意远古代代代代号。我们发现,我们的方法,即遗忘的Causal蒙码(FCM),大大改进了PALM的微光和微调性性能。我们进一步考虑一个简单的扩展,即T-FCM,在不改变序列顺序顺序的情况下引入因果关系语言模型的双向环境,并进一步改进性能。